1. 27 10月, 2018 7 次提交
  2. 07 9月, 2018 1 次提交
  3. 04 9月, 2018 1 次提交
    • W
      asm-generic/tlb: Track which levels of the page tables have been cleared · a6d60245
      Will Deacon 提交于
      It is common for architectures with hugepage support to require only a
      single TLB invalidation operation per hugepage during unmap(), rather than
      iterating through the mapping at a PAGE_SIZE increment. Currently,
      however, the level in the page table where the unmap() operation occurs
      is not stored in the mmu_gather structure, therefore forcing
      architectures to issue additional TLB invalidation operations or to give
      up and over-invalidate by e.g. invalidating the entire TLB.
      
      Ideally, we could add an interval rbtree to the mmu_gather structure,
      which would allow us to associate the correct mapping granule with the
      various sub-mappings within the range being invalidated. However, this
      is costly in terms of book-keeping and memory management, so instead we
      approximate by keeping track of the page table levels that are cleared
      and provide a means to query the smallest granule required for invalidation.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      a6d60245
  4. 26 8月, 2018 1 次提交
    • L
      mm/cow: don't bother write protecting already write-protected pages · 1b2de5d0
      Linus Torvalds 提交于
      This is not normally noticeable, but repeated forks are unnecessarily
      expensive because they repeatedly dirty the parent page tables during
      the page table copy operation.
      
      It's trivial to just avoid write protecting the page table entry if it
      was already not writable.
      
      This patch was inspired by
      
          https://bugzilla.kernel.org/show_bug.cgi?id=200447
      
      which points to an ancient "waste time re-doing fork" issue in the
      presence of lots of signals.
      
      That bug was fixed by Eric Biederman's signal handling series
      culminating in commit c3ad2c3b ("signal: Don't restart fork when
      signals come in"), but the unnecessary work for repeated forks is still
      work just fixing, particularly since the fix is trivial.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b2de5d0
  5. 24 8月, 2018 5 次提交
  6. 23 8月, 2018 1 次提交
  7. 18 8月, 2018 6 次提交
    • R
      Revert "mm: always flush VMA ranges affected by zap_page_range" · 50c150f2
      Rik van Riel 提交于
      There was a bug in Linux that could cause madvise (and mprotect?) system
      calls to return to userspace without the TLB having been flushed for all
      the pages involved.
      
      This could happen when multiple threads of a process made simultaneous
      madvise and/or mprotect calls.
      
      This was noticed in the summer of 2017, at which time two solutions
      were created:
      
        56236a59 ("mm: refactor TLB gathering API")
        99baac21 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
      and
        4647706e ("mm: always flush VMA ranges affected by zap_page_range")
      
      We need only one of these solutions, and the former appears to be a
      little more efficient than the latter, so revert that one.
      
      This reverts 4647706e ("mm: always flush VMA ranges affected by
      zap_page_range")
      
      Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.comSigned-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50c150f2
    • M
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • H
      mm, huge page: copy target sub-page last when copy huge page · c9f4cd71
      Huang Ying 提交于
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patch.  Because we have put the order algorithm into a separate
      function, the implementation is quite simple.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on
      transparent huge page.
      
      With this patch, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f4cd71
    • H
      mm, clear_huge_page: move order algorithm into a separate function · c6ddfb6c
      Huang Ying 提交于
      Patch series "mm, huge page: Copy target sub-page last when copy huge
      page", v2.
      
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patchset.
      
      The patchset is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patchset, we have tested it with vm-scalability run on
      transparent huge page.
      
      With this patchset, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      This patch (of 4):
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  This optimization could
      be applied to copying huge page too with the same order algorithm.  To
      avoid code duplication and reduce maintenance overhead, in this patch,
      the order algorithm is moved out of clear_huge_page() into a separate
      function: process_huge_page().  So that we can use it for copying huge
      page too.
      
      This will change the direct calls to clear_user_highpage() into the
      indirect calls.  But with the proper inline support of the compilers,
      the indirect call will be optimized to be the direct call.  Our tests
      show no performance change with the patch.
      
      This patch is a code cleanup without functionality change.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6ddfb6c
    • Y
      thp: use mm_file_counter to determine update which rss counter · fadae295
      Yang Shi 提交于
      Since commit eca56ff9 ("mm, shmem: add internal shmem resident
      memory accounting"), MM_SHMEMPAGES is added to separate the shmem
      accounting from regular files.  So, all shmem pages should be accounted
      to MM_SHMEMPAGES instead of MM_FILEPAGES.
      
      And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
      shmem thp pages should be not treated differently.  Account them to
      MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
      keep consistent with normal 4K shmem pages.
      
      This will not change the rss counter of processes since shmem pages are
      still a part of it.
      
      The /proc/pid/status and /proc/pid/statm counters will however be more
      accurate wrt shmem usage, as originally intended.  And as eca56ff9
      ("mm, shmem: add internal shmem resident memory accounting") mentioned,
      oom also could report more accurate "shmem-rss".
      
      Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fadae295
    • D
      dax: remove VM_MIXEDMAP for fsdax and device dax · e1fb4a08
      Dave Jiang 提交于
      This patch is reworked from an earlier patch that Dan has posted:
      https://patchwork.kernel.org/patch/10131727/
      
      VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
      the memory page it is dealing with is not typical memory from the linear
      map.  The get_user_pages_fast() path, since it does not resolve the vma,
      is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
      use that as a VM_MIXEDMAP replacement in some locations.  In the cases
      where there is no pte to consult we fallback to using vma_is_dax() to
      detect the VM_MIXEDMAP special case.
      
      Now that we have explicit driver pfn_t-flag opt-in/opt-out for
      get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
      also means we no longer need to worry about safely manipulating vm_flags
      in a future where we support dynamically changing the dax mode of a
      file.
      
      DAX should also now be supported with madvise_behavior(), vma_merge(),
      and copy_page_range().
      
      This patch has been tested against ndctl unit test.  It has also been
      tested against xfstests commit: 625515d using fake pmem created by
      memmap and no additional issues have been observed.
      
      Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1fb4a08
  8. 11 8月, 2018 1 次提交
  9. 02 8月, 2018 1 次提交
    • H
      mm: delete historical BUG from zap_pmd_range() · 53406ed1
      Hugh Dickins 提交于
      Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
      that mmap_sem must be held when splitting an "anonymous" vma there.
      Whether that's still strictly true nowadays is not entirely clear,
      but the danger of sometimes crashing on the BUG is now fairly clear.
      
      Even with the new stricter rules for anonymous vma marking, the
      condition it checks for can possible trigger. Commit 44960f2a
      ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
      pages") is good, and originally I thought it was safe from that
      VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
      disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
      insists on VM_SHARED.
      
      But after I read John's earlier mail, drawing attention to the
      vfs_fallocate() in there: I may be wrong, and I don't know if Android
      has THP in the config anyway, but it looks to me like an
      unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
      the VM_BUG_ON_VMA(), once it's vma_is_anonymous().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53406ed1
  10. 17 7月, 2018 1 次提交
    • R
      x86/mm/tlb: Leave lazy TLB mode at page table free time · 2ff6ddf1
      Rik van Riel 提交于
      Andy discovered that speculative memory accesses while in lazy
      TLB mode can crash a system, when a CPU tries to dereference a
      speculative access using memory contents that used to be valid
      page table memory, but have since been reused for something else
      and point into la-la land.
      
      The latter problem can be prevented in two ways. The first is to
      always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
      the second one is to only send the TLB shootdown at page table
      freeing time.
      
      The second should result in fewer IPIs, since operationgs like
      mprotect and madvise are very common with some workloads, but
      do not involve page table freeing. Also, on munmap, batching
      of page table freeing covers much larger ranges of virtual
      memory than the batching of unmapped user pages.
      Tested-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: kernel-team@fb.com
      Cc: luto@kernel.org
      Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2ff6ddf1
  11. 09 7月, 2018 1 次提交
  12. 21 6月, 2018 1 次提交
    • A
      x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings · 42e4089c
      Andi Kleen 提交于
      For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
      table entry. This sets the high bits in the CPU's address space, thus
      making sure to point to not point an unmapped entry to valid cached memory.
      
      Some server system BIOSes put the MMIO mappings high up in the physical
      address space. If such an high mapping was mapped to unprivileged users
      they could attack low memory by setting such a mapping to PROT_NONE. This
      could happen through a special device driver which is not access
      protected. Normal /dev/mem is of course access protected.
      
      To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
      
      Valid page mappings are allowed because the system is then unsafe anyways.
      
      It's not expected that users commonly use PROT_NONE on MMIO. But to
      minimize any impact this is only enforced if the mapping actually refers to
      a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
      the check for root.
      
      For mmaps this is straight forward and can be handled in vm_insert_pfn and
      in remap_pfn_range().
      
      For mprotect it's a bit trickier. At the point where the actual PTEs are
      accessed a lot of state has been changed and it would be difficult to undo
      on an error. Since this is a uncommon case use a separate early page talk
      walk pass for MMIO PROT_NONE mappings that checks for this condition
      early. For non MMIO and non PROT_NONE there are no changes.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      
      42e4089c
  13. 08 6月, 2018 3 次提交
  14. 01 6月, 2018 1 次提交
  15. 06 4月, 2018 2 次提交
  16. 18 3月, 2018 1 次提交
  17. 17 2月, 2018 1 次提交
    • A
      mm: hide a #warning for COMPILE_TEST · af27d940
      Arnd Bergmann 提交于
      We get a warning about some slow configurations in randconfig kernels:
      
        mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]
      
      The warning is reasonable by itself, but gets in the way of randconfig
      build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.
      
      The warning was added in 2013 in commit 75980e97 ("mm: fold
      page->_last_nid into page->flags where possible").
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af27d940
  18. 07 2月, 2018 1 次提交
  19. 01 2月, 2018 3 次提交
  20. 20 1月, 2018 1 次提交
    • D
      mm, dax: introduce pfn_t_special() · 785a3fab
      Dan Williams 提交于
      In support of removing the VM_MIXEDMAP indication from DAX VMAs,
      introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
      should be used for DAX ptes. This also helps identify drivers like
      dccssblk that only want to use DAX in a read-only fashion without
      get_user_pages() support.
      
      Ideally we could delete axonram and dcssblk DAX support, but if we need
      to keep it better make it explicit that axonram and dcssblk only support
      a sub-set of DAX due to missing _PAGE_DEVMAP support.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      785a3fab