1. 15 5月, 2019 3 次提交
    • J
      mm: introduce put_user_page*(), placeholder versions · fc1d8e7c
      John Hubbard 提交于
      A discussion of the overall problem is below.
      
      As mentioned in patch 0001, the steps are to fix the problem are:
      
      1) Provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      Overview
      ========
      
      Some kernel components (file systems, device drivers) need to access
      memory that is specified via process virtual address.  For a long time,
      the API to achieve that was get_user_pages ("GUP") and its variations.
      However, GUP has critical limitations that have been overlooked; in
      particular, GUP does not interact correctly with filesystems in all
      situations.  That means that file-backed memory + GUP is a recipe for
      potential problems, some of which have already occurred in the field.
      
      GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
      code to get the struct page behind a virtual address and to let storage
      hardware perform a direct copy to or from that page.  This is a
      short-lived access pattern, and as such, the window for a concurrent
      writeback of GUP'd page was small enough that there were not (we think)
      any reported problems.  Also, userspace was expected to understand and
      accept that Direct IO was not synchronized with memory-mapped access to
      that data, nor with any process address space changes such as munmap(),
      mremap(), etc.
      
      Over the years, more GUP uses have appeared (virtualization, device
      drivers, RDMA) that can keep the pages they get via GUP for a long period
      of time (seconds, minutes, hours, days, ...).  This long-term pinning
      makes an underlying design problem more obvious.
      
      In fact, there are a number of key problems inherent to GUP:
      
      Interactions with file systems
      ==============================
      
      File systems expect to be able to write back data, both to reclaim pages,
      and for data integrity.  Allowing other hardware (NICs, GPUs, etc) to gain
      write access to the file memory pages means that such hardware can dirty
      the pages, without the filesystem being aware.  This can, in some cases
      (depending on filesystem, filesystem options, block device, block device
      options, and other variables), lead to data corruption, and also to kernel
      bugs of the form:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on every
          write access to a clean file backed page, not just the first one.
          How long the GUP reference lasts is irrelevant, if the page is clean
          and you need to dirty it, you must call ->page_mkwrite before it is
          marked writeable and dirtied. Every. Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      Long term GUP
      =============
      
      Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
      writeable mapping is created), and the pages are file-backed.  That can
      lead to filesystem corruption.  What happens is that when a file-backed
      page is being written back, it is first mapped read-only in all of the CPU
      page tables; the file system then assumes that nobody can write to the
      page, and that the page content is therefore stable.  Unfortunately, the
      GUP callers generally do not monitor changes to the CPU pages tables; they
      instead assume that the following pattern is safe (it's not):
      
          get_user_pages()
      
          Hardware can keep a reference to those pages for a very long time,
          and write to it at any time.  Because "hardware" here means "devices
          that are not a CPU", this activity occurs without any interaction with
          the kernel's file system code.
      
          for each page
              set_page_dirty
              put_page()
      
      In fact, the GUP documentation even recommends that pattern.
      
      Anyway, the file system assumes that the page is stable (nothing is
      writing to the page), and that is a problem: stable page content is
      necessary for many filesystem actions during writeback, such as checksum,
      encryption, RAID striping, etc.  Furthermore, filesystem features like COW
      (copy on write) or snapshot also rely on being able to use a new page for
      as memory for that memory range inside the file.
      
      Corruption during write back is clearly possible here.  To solve that, one
      idea is to identify pages that have active GUP, so that we can use a
      bounce page to write stable data to the filesystem.  The filesystem would
      work on the bounce page, while any of the active GUP might write to the
      original page.  This would avoid the stable page violation problem, but
      note that it is only part of the overall solution, because other problems
      remain.
      
      Other filesystem features that need to replace the page with a new one can
      be inhibited for pages that are GUP-pinned.  This will, however, alter and
      limit some of those filesystem features.  The only fix for that would be
      to require GUP users to monitor and respond to CPU page table updates.
      Subsystems such as ODP and HMM do this, for example.  This aspect of the
      problem is still under discussion.
      
      Direct IO
      =========
      
      Direct IO can cause corruption, if userspace does Direct-IO that writes to
      a range of virtual addresses that are mmap'd to a file.  The pages written
      to are file-backed pages that can be under write back, while the Direct IO
      is taking place.  Here, Direct IO races with a write back: it calls GUP
      before page_mkclean() has replaced the CPU pte with a read-only entry.
      The race window is pretty small, which is probably why years have gone by
      before we noticed this problem: Direct IO is generally very quick, and
      tends to finish up before the filesystem gets around to do anything with
      the page contents.  However, it's still a real problem.  The solution is
      to never let GUP return pages that are under write back, but instead,
      force GUP to take a write fault on those pages.  That way, GUP will
      properly synchronize with the active write back.  This does not change the
      required GUP behavior, it just avoids that race.
      
      Details
      =======
      
      Introduces put_user_page(), which simply calls put_page().  This provides
      a way to update all get_user_pages*() callers, so that they call
      put_user_page(), instead of put_page().
      
      Also introduces put_user_pages(), and a few dirty/locked variations, as a
      replacement for release_pages(), and also as a replacement for open-coded
      loops that release multiple pages.  These may be used for subsequent
      performance improvements, via batching of pages to be released.
      
      This is the first step of fixing a problem (also described in [1] and [2])
      with interactions between get_user_pages ("gup") and filesystems.
      
      Problem description: let's start with a bug report.  Below, is what
      happens sometimes, under memory pressure, when a driver pins some pages
      via gup, and then marks those pages dirty, and releases them.  Note that
      the gup documentation actually recommends that pattern.  The problem is
      that the filesystem may do a writeback while the pages were gup-pinned,
      and then the filesystem believes that the pages are clean.  So, when the
      driver later marks the pages as dirty, that conflicts with the
      filesystem's page tracking and results in a BUG(), like this one that I
      experienced:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on
          every write access to a clean file backed page, not just the first
          one.  How long the GUP reference lasts is irrelevant, if the page is
          clean and you need to dirty it, you must call ->page_mkwrite before it
          is marked writeable and dirtied.  Every.  Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      The steps are to fix it are:
      
      1) (This patch): provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
      [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
      
      Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[docs]
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc1d8e7c
    • I
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny 提交于
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • I
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny 提交于
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
  2. 30 4月, 2019 1 次提交
    • R
      mm/hibernation: Make hibernation handle unmapped pages · d6332692
      Rick Edgecombe 提交于
      Make hibernate handle unmapped pages on the direct map when
      CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages
      to invalid configurations, so now hibernate should check if the pages have
      valid mappings and handle if they are unmapped when doing a hibernate
      save operation.
      
      Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y
      was configured. It does not appear to have a big hibernating performance
      impact. The speed of the saving operation before this change was measured
      as 819.02 MB/s, and after was measured at 813.32 MB/s.
      
      Before:
      [    4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)
      
      After:
      [    4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: <will.deacon@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6332692
  3. 15 4月, 2019 2 次提交
    • L
      mm: add 'try_get_page()' helper function · 88b1a17d
      Linus Torvalds 提交于
      This is the same as the traditional 'get_page()' function, but instead
      of unconditionally incrementing the reference count of the page, it only
      does so if the count was "safe".  It returns whether the reference count
      was incremented (and is marked __must_check, since the caller obviously
      has to be aware of it).
      
      Also like 'get_page()', you can't use this function unless you already
      had a reference to the page.  The intent is that you can use this
      exactly like get_page(), but in situations where you want to limit the
      maximum reference count.
      
      The code currently does an unconditional WARN_ON_ONCE() if we ever hit
      the reference count issues (either zero or negative), as a notification
      that the conditional non-increment actually happened.
      
      NOTE! The count access for the "safety" check is inherently racy, but
      that doesn't matter since the buffer we use is basically half the range
      of the reference count (ie we look at the sign of the count).
      Acked-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88b1a17d
    • L
      mm: make page ref count overflow check tighter and more explicit · f958d7b5
      Linus Torvalds 提交于
      We have a VM_BUG_ON() to check that the page reference count doesn't
      underflow (or get close to overflow) by checking the sign of the count.
      
      That's all fine, but we actually want to allow people to use a "get page
      ref unless it's already very high" helper function, and we want that one
      to use the sign of the page ref (without triggering this VM_BUG_ON).
      
      Change the VM_BUG_ON to only check for small underflows (or _very_ close
      to overflowing), and ignore overflows which have strayed into negative
      territory.
      Acked-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f958d7b5
  4. 13 3月, 2019 1 次提交
  5. 08 3月, 2019 1 次提交
  6. 06 3月, 2019 1 次提交
    • A
      mm: update get_user_pages_longterm to migrate pages allocated from CMA region · 9a4e9f3b
      Aneesh Kumar K.V 提交于
      This patch updates get_user_pages_longterm to migrate pages allocated
      out of CMA region.  This makes sure that we don't keep non-movable pages
      (due to page reference count) in the CMA area.
      
      This will be used by ppc64 in a later patch to avoid pinning pages in
      the CMA region.  ppc64 uses CMA region for allocation of the hardware
      page table (hash page table) and not able to migrate pages out of CMA
      region results in page table allocation failures.
      
      One case where we hit this easy is when a guest using a VFIO passthrough
      device.  VFIO locks all the guest's memory and if the guest memory is
      backed by CMA region, it becomes unmovable resulting in fragmenting the
      CMA and possibly preventing other guests from allocation a large enough
      hash page table.
      
      NOTE: We allocate the new page without using __GFP_THISNODE
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a4e9f3b
  7. 05 1月, 2019 2 次提交
    • N
      fs: don't open code lru_to_page() · f86196ea
      Nikolay Borisov 提交于
      Multiple filesystems open code lru_to_page().  Rectify this by moving
      the macro from mm_inline (which is specific to lru stuff) to the more
      generic mm.h header and start using the macro where appropriate.
      
      No functional changes.
      
      Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
      Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.comSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Acked-by: "Yan, Zheng" <zyan@redhat.com>		[ceph]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f86196ea
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
  8. 29 12月, 2018 6 次提交
  9. 02 11月, 2018 1 次提交
    • M
      mm: add mm_pxd_folded checks to pgtable_bytes accounting functions · 6d212db1
      Martin Schwidefsky 提交于
      The common mm code calls mm_dec_nr_pmds() and mm_dec_nr_puds()
      in free_pgtables() if the address range spans a full pud or pmd.
      If mm_dec_nr_puds/mm_dec_nr_pmds are non-empty due to configuration
      settings they blindly subtract the size of the pmd or pud table from
      pgtable_bytes even if the pud or pmd page table layer is folded.
      
      Add explicit mm_[pmd|pud]_folded checks to the four pgtable_bytes
      accounting functions mm_inc_nr_puds, mm_inc_nr_pmds, mm_dec_nr_puds
      and mm_dec_nr_pmds. As the check for folded page tables can be
      overwritten by the architecture, this allows to keep a correct
      pgtable_bytes value for platforms that use a dynamic number of
      page table levels.
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      6d212db1
  10. 31 10月, 2018 1 次提交
  11. 27 10月, 2018 7 次提交
  12. 11 10月, 2018 1 次提交
    • L
      PCI/P2PDMA: Support peer-to-peer memory · 52916982
      Logan Gunthorpe 提交于
      Some PCI devices may have memory mapped in a BAR space that's intended for
      use in peer-to-peer transactions.  To enable such transactions the memory
      must be registered with ZONE_DEVICE pages so it can be used by DMA
      interfaces in existing drivers.
      
      Add an interface for other subsystems to find and allocate chunks of P2P
      memory as necessary to facilitate transfers between two PCI peers:
      
        struct pci_dev *pci_p2pmem_find[_many]();
        int pci_p2pdma_distance[_many]();
        void *pci_alloc_p2pmem();
      
      The new interface requires a driver to collect a list of client devices
      involved in the transaction then call pci_p2pmem_find() to obtain any
      suitable P2P memory.  Alternatively, if the caller knows a device which
      provides P2P memory, they can use pci_p2pdma_distance() to determine if it
      is usable.  With a suitable p2pmem device, memory can then be allocated
      with pci_alloc_p2pmem() for use in DMA transactions.
      
      Depending on hardware, using peer-to-peer memory may reduce the bandwidth
      of the transfer but can significantly reduce pressure on system memory.
      This may be desirable in many cases: for example a system could be designed
      with a small CPU connected to a PCIe switch by a small number of lanes
      which would maximize the number of lanes available to connect to NVMe
      devices.
      
      The code is designed to only utilize the p2pmem device if all the devices
      involved in a transfer are behind the same PCI bridge.  This is because we
      have no way of knowing whether peer-to-peer routing between PCIe Root Ports
      is supported (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P
      transfers that go through the RC is limited to only reducing DRAM usage
      and, in some cases, coding convenience.  The PCI-SIG may be exploring
      adding a new capability bit to advertise whether this is possible for
      future hardware.
      
      This commit includes significant rework and feedback from Christoph
      Hellwig.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      [bhelgaas: fold in fix from Keith Busch <keith.busch@intel.com>:
      https://lore.kernel.org/linux-pci/20181012155920.15418-1-keith.busch@intel.com,
      to address comment from Dan Carpenter <dan.carpenter@oracle.com>, fold in
      https://lore.kernel.org/linux-pci/20181017160510.17926-1-logang@deltatee.com]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      52916982
  13. 06 10月, 2018 1 次提交
    • M
      mm: migration: fix migration of huge PMD shared pages · 017b1660
      Mike Kravetz 提交于
      The page migration code employs try_to_unmap() to try and unmap the source
      page.  This is accomplished by using rmap_walk to find all vmas where the
      page is mapped.  This search stops when page mapcount is zero.  For shared
      PMD huge pages, the page map count is always 1 no matter the number of
      mappings.  Shared mappings are tracked via the reference count of the PMD
      page.  Therefore, try_to_unmap stops prematurely and does not completely
      unmap all mappings of the source page.
      
      This problem can result is data corruption as writes to the original
      source page can happen after contents of the page are copied to the target
      page.  Hence, data is lost.
      
      This problem was originally seen as DB corruption of shared global areas
      after a huge page was soft offlined due to ECC memory errors.  DB
      developers noticed they could reproduce the issue by (hotplug) offlining
      memory used to back huge pages.  A simple testcase can reproduce the
      problem by creating a shared PMD mapping (note that this must be at least
      PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
      migrate_pages() to migrate process pages between nodes while continually
      writing to the huge pages being migrated.
      
      To fix, have the try_to_unmap_one routine check for huge PMD sharing by
      calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
      mapping it will be 'unshared' which removes the page table entry and drops
      the reference on the PMD page.  After this, flush caches and TLB.
      
      mmu notifiers are called before locking page tables, but we can not be
      sure of PMD sharing until page tables are locked.  Therefore, check for
      the possibility of PMD sharing before locking so that notifiers can
      prepare for the worst possible case.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
      [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
        Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      017b1660
  14. 24 8月, 2018 1 次提交
  15. 23 8月, 2018 3 次提交
  16. 18 8月, 2018 4 次提交
    • P
      mm/sparse: delete old sparse_init and enable new one · 2a3cb8ba
      Pavel Tatashin 提交于
      Rename new_sparse_init() to sparse_init() which enables it.  Delete old
      sparse_init() and all the code that became obsolete with.
      
      [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
        Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a3cb8ba
    • P
      mm/sparse: move buffer init/fini to the common place · afda57bc
      Pavel Tatashin 提交于
      Now that both variants of sparse memory use the same buffers to populate
      memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
      common place.
      
      Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afda57bc
    • P
      mm/sparse: abstract sparse buffer allocations · 35fd1eb1
      Pavel Tatashin 提交于
      Patch series "sparse_init rewrite", v6.
      
      In sparse_init() we allocate two large buffers to temporary hold usemap
      and memmap for the whole machine.  However, we can avoid doing that if
      we changed sparse_init() to operated on per-node bases instead of doing
      it on the whole machine beforehand.
      
      As shown by Baoquan
        http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com
      
      The buffers are large enough to cause machine stop to boot on small
      memory systems.
      
      Another benefit of these changes is that they also obsolete
      CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.
      
      This patch (of 5):
      
      When struct pages are allocated for sparse-vmemmap VA layout, we first try
      to allocate one large buffer, and than if that fails allocate struct pages
      for each section as we go.
      
      The code that allocates buffer is uses global variables and is spread
      across several call sites.
      
      Cleanup the code by introducing three functions to handle the global
      buffer:
      
      sparse_buffer_init()	initialize the buffer
      sparse_buffer_fini()	free the remaining part of the buffer
      sparse_buffer_alloc()	alloc from the buffer, and if buffer is empty
      return NULL
      
      Define these functions in sparse.c instead of sparse-vmemmap.c because
      later we will use them for non-vmemmap sparse allocations as well.
      
      [akpm@linux-foundation.org: use PTR_ALIGN()]
      [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
      Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35fd1eb1
    • H
      mm, huge page: copy target sub-page last when copy huge page · c9f4cd71
      Huang Ying 提交于
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patch.  Because we have put the order algorithm into a separate
      function, the implementation is quite simple.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on
      transparent huge page.
      
      With this patch, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f4cd71
  17. 02 8月, 2018 1 次提交
    • L
      mm: do not initialize TLB stack vma's with vma_init() · 8b11ec1b
      Linus Torvalds 提交于
      Commit 2c4541e2 ("mm: use vma_init() to initialize VMAs on stack and
      data segments") tried to initialize various left-over ad-hoc vma's
      "properly", but actually made things worse for the temporary vma's used
      for TLB flushing.
      
      vma_init() doesn't actually initialize all of the vma, just a few
      fields, so doing something like
      
         -       struct vm_area_struct vma = { .vm_mm = tlb->mm, };
         +       struct vm_area_struct vma;
         +
         +       vma_init(&vma, tlb->mm);
      
      was actually very bad: instead of having a nicely initialized vma with
      every field but "vm_mm" zeroed, you'd have an entirely uninitialized vma
      with only a couple of fields initialized.  And they weren't even fields
      that the code in question mostly cared about.
      
      The flush_tlb_range() function takes a "struct vma" rather than a
      "struct mm_struct", because a few architectures actually care about what
      kind of range it is - being able to only do an ITLB flush if it's a
      range that doesn't have data accesses enabled, for example.  And all the
      normal users already have the vma for doing the range invalidation.
      
      But a few people want to call flush_tlb_range() with a range they just
      made up, so they also end up using a made-up vma.  x86 just has a
      special "flush_tlb_mm_range()" function for this, but other
      architectures (arm and ia64) do the "use fake vma" thing instead, and
      thus got caught up in the vma_init() changes.
      
      At the same time, the TLB flushing code really doesn't care about most
      other fields in the vma, so vma_init() is just unnecessary and
      pointless.
      
      This fixes things by having an explicit "this is just an initializer for
      the TLB flush" initializer macro, which is used by the arm/arm64/ia64
      people who mis-use this interface with just a dummy vma.
      
      Fixes: 2c4541e2 ("mm: use vma_init() to initialize VMAs on stack and data segments")
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b11ec1b
  18. 27 7月, 2018 2 次提交
    • K
      mm: fix vma_is_anonymous() false-positives · bfd40eaf
      Kirill A. Shutemov 提交于
      vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
      VMA.  This is unreliable as ->mmap may not set ->vm_ops.
      
      False-positive vma_is_anonymous() may lead to crashes:
      
      	next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
      	prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
      	pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
      	flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
      	------------[ cut here ]------------
      	kernel BUG at mm/memory.c:1422!
      	invalid opcode: 0000 [#1] SMP KASAN
      	CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      	01/01/2011
      	RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
      	RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
      	RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
      	RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
      	Call Trace:
      	 unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
      	 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
      	 unmap_mapping_range_vma mm/memory.c:2792 [inline]
      	 unmap_mapping_range_tree mm/memory.c:2813 [inline]
      	 unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
      	 unmap_mapping_range+0x48/0x60 mm/memory.c:2880
      	 truncate_pagecache+0x54/0x90 mm/truncate.c:800
      	 truncate_setsize+0x70/0xb0 mm/truncate.c:826
      	 simple_setattr+0xe9/0x110 fs/libfs.c:409
      	 notify_change+0xf13/0x10f0 fs/attr.c:335
      	 do_truncate+0x1ac/0x2b0 fs/open.c:63
      	 do_sys_ftruncate+0x492/0x560 fs/open.c:205
      	 __do_sys_ftruncate fs/open.c:215 [inline]
      	 __se_sys_ftruncate fs/open.c:213 [inline]
      	 __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
      	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reproducer:
      
      	#include <stdio.h>
      	#include <stddef.h>
      	#include <stdint.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/types.h>
      	#include <sys/stat.h>
      	#include <sys/ioctl.h>
      	#include <sys/mman.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      
      	#define KCOV_INIT_TRACE			_IOR('c', 1, unsigned long)
      	#define KCOV_ENABLE			_IO('c', 100)
      	#define KCOV_DISABLE			_IO('c', 101)
      	#define COVER_SIZE			(1024<<10)
      
      	#define KCOV_TRACE_PC  0
      	#define KCOV_TRACE_CMP 1
      
      	int main(int argc, char **argv)
      	{
      		int fd;
      		unsigned long *cover;
      
      		system("mount -t debugfs none /sys/kernel/debug");
      		fd = open("/sys/kernel/debug/kcov", O_RDWR);
      		ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
      		munmap(cover, COVER_SIZE * sizeof(unsigned long));
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
      		memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
      		ftruncate(fd, 3UL << 20);
      		return 0;
      	}
      
      This can be fixed by assigning anonymous VMAs own vm_ops and not relying
      on it being NULL.
      
      If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
      dummy_vm_ops.  This way we will have non-NULL ->vm_ops for all VMAs.
      
      Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfd40eaf
    • K
      mm: introduce vma_init() · 027232da
      Kirill A. Shutemov 提交于
      Not all VMAs allocated with vm_area_alloc().  Some of them allocated on
      stack or in data segment.
      
      The new helper can be use to initialize VMA properly regardless where it
      was allocated.
      
      Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      027232da
  19. 24 7月, 2018 1 次提交
    • D
      mm, memory_failure: Teach memory_failure() about dev_pagemap pages · 6100e34b
      Dan Williams 提交于
      mce: Uncorrected hardware memory error in user-access at af34214200
          {1}[Hardware Error]: It has been corrected by h/w and requires no further action
          mce: [Hardware Error]: Machine check events logged
          {1}[Hardware Error]: event severity: corrected
          Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
          [..]
          Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
          mce: Memory error not recovered
      
      In contrast to typical memory, dev_pagemap pages may be dax mapped. With
      dax there is no possibility to map in another page dynamically since dax
      establishes 1:1 physical address to file offset associations. Also
      dev_pagemap pages associated with NVDIMM / persistent memory devices can
      internal remap/repair addresses with poison. While memory_failure()
      assumes that it can discard typical poisoned pages and keep them
      unmapped indefinitely, dev_pagemap pages may be returned to service
      after the error is cleared.
      
      Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
      dev_pagemap pages that have poison consumed by userspace. Mark the
      memory as UC instead of unmapping it completely to allow ongoing access
      via the device driver (nd_pmem). Later, nd_pmem will grow support for
      marking the page back to WB when the error is cleared.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      6100e34b