1. 14 4月, 2017 1 次提交
  2. 25 2月, 2017 1 次提交
  3. 11 9月, 2015 2 次提交
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
    • V
      mmu-notifier: add clear_young callback · 1d7715c6
      Vladimir Davydov 提交于
      In the scope of the idle memory tracking feature, which is introduced by
      the following patch, we need to clear the referenced/accessed bit not only
      in primary, but also in secondary ptes.  The latter is required in order
      to estimate wss of KVM VMs.  At the same time we want to avoid flushing
      tlb, because it is quite expensive and it won't really affect the final
      result.
      
      Currently, there is no function for clearing pte young bit that would meet
      our requirements, so this patch introduces one.  To achieve that we have
      to add a new mmu-notifier callback, clear_young, since there is no method
      for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
      is not mandatory and currently only implemented by KVM.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d7715c6
  4. 25 6月, 2015 1 次提交
  5. 14 12月, 2014 1 次提交
  6. 24 9月, 2014 1 次提交
    • A
      kvm: Fix page ageing bugs · 57128468
      Andres Lagar-Cavilla 提交于
      1. We were calling clear_flush_young_notify in unmap_one, but we are
      within an mmu notifier invalidate range scope. The spte exists no more
      (due to range_start) and the accessed bit info has already been
      propagated (due to kvm_pfn_set_accessed). Simply call
      clear_flush_young.
      
      2. We clear_flush_young on a primary MMU PMD, but this may be mapped
      as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
      This required expanding the interface of the clear_flush_young mmu
      notifier, so a lot of code has been trivially touched.
      
      3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
      the access bit by blowing the spte. This requires proper synchronizing
      with MMU notifier consumers, like every other removal of spte's does.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57128468
  7. 07 8月, 2014 1 次提交
    • P
      mmu_notifier: add call_srcu and sync function for listener to delay call and sync · b972216e
      Peter Zijlstra 提交于
      When kernel device drivers or subsystems want to bind their lifespan to
      t= he lifespan of the mm_struct, they usually use one of the following
      methods:
      
      1. Manually calling a function in the interested kernel module.  The
         funct= ion call needs to be placed in mmput.  This method was rejected
         by several ker= nel maintainers.
      
      2. Registering to the mmu notifier release mechanism.
      
      The problem with the latter approach is that the mmu_notifier_release
      cal= lback is called from__mmu_notifier_release (called from exit_mmap).
      That functi= on iterates over the list of mmu notifiers and don't expect
      the release call= back function to remove itself from the list.
      Therefore, the callback function= in the kernel module can't release the
      mmu_notifier_object, which is actuall= y the kernel module's object
      itself.  As a result, the destruction of the kernel module's object must
      to be done in a delayed fashion.
      
      This patch adds support for this delayed callback, by adding a new
      mmu_notifier_call_srcu function that receives a function ptr and calls
      th= at function with call_srcu.  In that function, the kernel module
      releases its object.  To use mmu_notifier_call_srcu, the calling module
      needs to call b= efore that a new function called
      mmu_notifier_unregister_no_release that as its= name implies,
      unregisters a notifier without calling its notifier release call= back.
      
      This patch also adds a function that will call barrier_srcu so those
      kern= el modules can sync with mmu_notifier.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NOded Gabbay <oded.gabbay@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b972216e
  8. 13 11月, 2014 3 次提交
    • J
      mmu_notifier: add the callback for mmu_notifier_invalidate_range() · 0f0a327f
      Joerg Roedel 提交于
      Now that the mmu_notifier_invalidate_range() calls are in place, add the
      callback to allow subsystems to register against it.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Jay Cornwall <Jay.Cornwall@amd.com>
      Cc: Oded Gabbay <Oded.Gabbay@amd.com>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NOded Gabbay <oded.gabbay@amd.com>
      0f0a327f
    • J
      mmu_notifier: call mmu_notifier_invalidate_range() from VMM · 34ee645e
      Joerg Roedel 提交于
      Add calls to the new mmu_notifier_invalidate_range() function to all
      places in the VMM that need it.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Jay Cornwall <Jay.Cornwall@amd.com>
      Cc: Oded Gabbay <Oded.Gabbay@amd.com>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NOded Gabbay <oded.gabbay@amd.com>
      34ee645e
    • J
      mmu_notifier: add mmu_notifier_invalidate_range() · 1897bdc4
      Joerg Roedel 提交于
      This notifier closes an important gap in the current mmu_notifier
      implementation, the existing callbacks are called too early or too late to
      reliably manage a non-CPU TLB.  Specifically, invalidate_range_start() is
      called when all pages are still mapped and invalidate_range_end() when all
      pages are unmapped and potentially freed.
      
      This is fine when the users of the mmu_notifiers manage their own SoftTLB,
      like KVM does.  When the TLB is managed in software it is easy to wipe out
      entries for a given range and prevent new entries to be established until
      invalidate_range_end is called.
      
      But when the user of mmu_notifiers has to manage a hardware TLB it can
      still wipe out TLB entries in invalidate_range_start, but it can't make
      sure that no new TLB entries in the given range are established between
      invalidate_range_start and invalidate_range_end.
      
      To avoid silent data corruption the entries in the non-CPU TLB need to be
      flushed when the pages are unmapped (at this point in time no _new_ TLB
      entries can be established in the non-CPU TLB) but not yet freed (as the
      non-CPU TLB may still have _existing_ entries pointing to the pages about
      to be freed).
      
      To fix this problem we need to catch the moment when the Linux VMM flushes
      remote TLBs (as a non-CPU TLB is not very CPU TLB), as this is the point
      in time when the pages are unmapped but _not_ yet freed.
      
      The mmu_notifier_invalidate_range() function aims to catch that moment.
      
      IOMMU code will be one user of the notifier-callback.  Currently this is
      only the AMD IOMMUv2 driver, but its code is about to be more generalized
      and converted to a generic IOMMU-API extension to fit the needs of similar
      functionality in other IOMMUs as well.
      
      The current attempt in the AMD IOMMUv2 driver to work around the
      invalidate_range_start/end() shortcoming is to assign an empty page table
      to the non-CPU TLB between any invalidata_range_start/end calls.  With the
      empty page-table assigned, every page-table walk to re-fill the non-CPU
      TLB will cause a page-fault reported to the IOMMU driver via an interrupt,
      possibly causing interrupt storms.
      
      The page-fault handler in the AMD IOMMUv2 driver doesn't handle the fault
      if an invalidate_range_start/end pair is active, it just reports back
      SUCCESS to the device and let it refault the page.  But existing hardware
      (newer Radeon GPUs) that makes use of this feature don't re-fault
      indefinitly, after a certain number of faults for the same address the
      device enters a failure state and needs to be resetted.
      
      To avoid the GPUs entering a failure state we need to get rid of the
      empty-page-table workaround and use the mmu_notifier_invalidate_range()
      function introduced with this patch.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Jay Cornwall <Jay.Cornwall@amd.com>
      Cc: Oded Gabbay <Oded.Gabbay@amd.com>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NOded Gabbay <oded.gabbay@amd.com>
      1897bdc4
  9. 05 2月, 2013 1 次提交
  10. 09 10月, 2012 3 次提交
    • S
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg 提交于
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • S
      mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely schedule · 21a92735
      Sagi Grimberg 提交于
      With an RCU based mmu_notifier implementation, any callout to
      mmu_notifier_invalidate_range_{start,end}() or
      mmu_notifier_invalidate_page() would not be allowed to call schedule()
      as that could potentially allow a modification to the mmu_notifier
      structure while it is currently being used.
      
      Since srcu allocs 4 machine words per instance per cpu, we may end up
      with memory exhaustion if we use srcu per mm.  So all mms share a global
      srcu.  Note that during large mmu_notifier activity exit & unregister
      paths might hang for longer periods, but it is tolerable for current
      mmu_notifier clients.
      Signed-off-by: NSagi Grimberg <sagig@mellanox.co.il>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21a92735
    • X
      mm: mmu_notifier: fix inconsistent memory between secondary MMU and host · 48af0d7c
      Xiao Guangrong 提交于
      There is a bug in set_pte_at_notify() which always sets the pte to the
      new page before releasing the old page in the secondary MMU.  At this
      time, the process will access on the new page, but the secondary MMU
      still access on the old page, the memory is inconsistent between them
      
      The below scenario shows the bug more clearly:
      
      at the beginning: *p = 0, and p is write-protected by KSM or shared with
      parent process
      
      CPU 0                                       CPU 1
      write 1 to p to trigger COW,
      set_pte_at_notify will be called:
        *pte = new_page + W; /* The W bit of pte is set */
      
                                           *p = 1; /* pte is valid, so no #PF */
      
                                           return back to secondary MMU, then
                                           the secondary MMU read p, but get:
                                           *p == 0;
      
                               /*
                                * !!!!!!
                                * the host has already set p to 1, but the secondary
                                * MMU still get the old value 0
                                */
      
        call mmu_notifier_change_pte to release
        old page in secondary MMU
      
      We can fix it by release old page first, then set the pte to the new
      page.
      
      Note, the new page will be firstly used in secondary MMU before it is
      mapped into the page table of the process, but this is safe because it
      is protected by the page table lock, there is no race to change the pte
      
      [akpm@linux-foundation.org: add comment from Andrea]
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48af0d7c
  11. 25 5月, 2011 2 次提交
  12. 14 1月, 2011 2 次提交
  13. 29 10月, 2010 1 次提交
  14. 22 9月, 2009 1 次提交
    • I
      ksm: add mmu_notifier set_pte_at_notify() · 828502d3
      Izik Eidus 提交于
      KSM is a linux driver that allows dynamicly sharing identical memory pages
      between one or more processes.
      
      Unlike tradtional page sharing that is made at the allocation of the
      memory, ksm do it dynamicly after the memory was created.  Memory is
      periodically scanned; identical pages are identified and merged.
      
      The sharing is made in a transparent way to the processes that use it.
      
      Ksm is highly important for hypervisors (kvm), where in production
      enviorments there might be many copys of the same data data among the host
      memory.  This kind of data can be: similar kernels, librarys, cache, and
      so on.
      
      Even that ksm was wrote for kvm, any userspace application that want to
      use it to share its data can try it.
      
      Ksm may be useful for any application that might have similar (page
      aligment) data strctures among the memory, ksm will find this data merge
      it to one copy, and even if it will be changed and thereforew copy on
      writed, ksm will merge it again as soon as it will be identical again.
      
      Another reason to consider using ksm is the fact that it might simplify
      alot the userspace code of application that want to use shared private
      data, instead that the application will mange shared area, ksm will do
      this for the application, and even write to this data will be allowed
      without any synchinization acts from the application.
      
      Ksm was designed to be a loadable module that doesn't change the VM code
      of linux.
      
      This patch:
      
      The set_pte_at_notify() macro allows setting a pte in the shadow page
      table directly, instead of flushing the shadow page table entry and then
      getting vmexit to set it.  It uses a new change_pte() callback to do so.
      
      set_pte_at_notify() is an optimization for kvm, and other users of
      mmu_notifiers, for COW pages.  It is useful for kvm when ksm is used,
      because it allows kvm not to have to receive vmexit and only then map the
      ksm page into the shadow page table, but instead map it directly at the
      same time as Linux maps the page into the host page table.
      
      Users of mmu_notifiers who don't implement new mmu_notifier_change_pte()
      callback will just receive the mmu_notifier_invalidate_page() callback.
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      828502d3
  15. 29 7月, 2008 1 次提交
    • A
      mmu-notifiers: core · cddb8a5c
      Andrea Arcangeli 提交于
      With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
       There are secondary MMUs (with secondary sptes and secondary tlbs) too.
      sptes in the kvm case are shadow pagetables, but when I say spte in
      mmu-notifier context, I mean "secondary pte".  In GRU case there's no
      actual secondary pte and there's only a secondary tlb because the GRU
      secondary MMU has no knowledge about sptes and every secondary tlb miss
      event in the MMU always generates a page fault that has to be resolved by
      the CPU (this is not the case of KVM where the a secondary tlb miss will
      walk sptes in hardware and it will refill the secondary tlb transparently
      to software if the corresponding spte is present).  The same way
      zap_page_range has to invalidate the pte before freeing the page, the spte
      (and secondary tlb) must also be invalidated before any page is freed and
      reused.
      
      Currently we take a page_count pin on every page mapped by sptes, but that
      means the pages can't be swapped whenever they're mapped by any spte
      because they're part of the guest working set.  Furthermore a spte unmap
      event can immediately lead to a page to be freed when the pin is released
      (so requiring the same complex and relatively slow tlb_gather smp safe
      logic we have in zap_page_range and that can be avoided completely if the
      spte unmap event doesn't require an unpin of the page previously mapped in
      the secondary MMU).
      
      The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
      when the VM is swapping or freeing or doing anything on the primary MMU so
      that the secondary MMU code can drop sptes before the pages are freed,
      avoiding all page pinning and allowing 100% reliable swapping of guest
      physical address space.  Furthermore it avoids the code that teardown the
      mappings of the secondary MMU, to implement a logic like tlb_gather in
      zap_page_range that would require many IPI to flush other cpu tlbs, for
      each fixed number of spte unmapped.
      
      To make an example: if what happens on the primary MMU is a protection
      downgrade (from writeable to wrprotect) the secondary MMU mappings will be
      invalidated, and the next secondary-mmu-page-fault will call
      get_user_pages and trigger a do_wp_page through get_user_pages if it
      called get_user_pages with write=1, and it'll re-establishing an updated
      spte or secondary-tlb-mapping on the copied page.  Or it will setup a
      readonly spte or readonly tlb mapping if it's a guest-read, if it calls
      get_user_pages with write=0.  This is just an example.
      
      This allows to map any page pointed by any pte (and in turn visible in the
      primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
      full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
      with kvm), or a remote DMA in software like XPMEM (hence needing of
      schedule in XPMEM code to send the invalidate to the remote node, while no
      need to schedule in kvm/gru as it's an immediate event like invalidating
      primary-mmu pte).
      
      At least for KVM without this patch it's impossible to swap guests
      reliably.  And having this feature and removing the page pin allows
      several other optimizations that simplify life considerably.
      
      Dependencies:
      
      1) mm_take_all_locks() to register the mmu notifier when the whole VM
         isn't doing anything with "mm".  This allows mmu notifier users to keep
         track if the VM is in the middle of the invalidate_range_begin/end
         critical section with an atomic counter incraese in range_begin and
         decreased in range_end.  No secondary MMU page fault is allowed to map
         any spte or secondary tlb reference, while the VM is in the middle of
         range_begin/end as any page returned by get_user_pages in that critical
         section could later immediately be freed without any further
         ->invalidate_page notification (invalidate_range_begin/end works on
         ranges and ->invalidate_page isn't called immediately before freeing
         the page).  To stop all page freeing and pagetable overwrites the
         mmap_sem must be taken in write mode and all other anon_vma/i_mmap
         locks must be taken too.
      
      2) It'd be a waste to add branches in the VM if nobody could possibly
         run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
         CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
         mmu notifiers, but this already allows to compile a KVM external module
         against a kernel with mmu notifiers enabled and from the next pull from
         kvm.git we'll start using them.  And GRU/XPMEM will also be able to
         continue the development by enabling KVM=m in their config, until they
         submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
         also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
         This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
         are all =n.
      
      The mmu_notifier_register call can fail because mm_take_all_locks may be
      interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
      is used when a driver startup, a failure can be gracefully handled.  Here
      an example of the change applied to kvm to register the mmu notifiers.
      Usually when a driver startups other allocations are required anyway and
      -ENOMEM failure paths exists already.
      
       struct  kvm *kvm_arch_create_vm(void)
       {
              struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
      +       int err;
      
              if (!kvm)
                      return ERR_PTR(-ENOMEM);
      
              INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
      
      +       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
      +       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
      +       if (err) {
      +               kfree(kvm);
      +               return ERR_PTR(err);
      +       }
      +
              return kvm;
       }
      
      mmu_notifier_unregister returns void and it's reliable.
      
      The patch also adds a few needed but missing includes that would prevent
      kernel to compile after these changes on non-x86 archs (x86 didn't need
      them by luck).
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
      [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cddb8a5c