1. 03 3月, 2017 1 次提交
  2. 08 2月, 2017 1 次提交
    • M
      s390: add no-execute support · 57d7f939
      Martin Schwidefsky 提交于
      Bit 0x100 of a page table, segment table of region table entry
      can be used to disallow code execution for the virtual addresses
      associated with the entry.
      
      There is one tricky bit, the system call to return from a signal
      is part of the signal frame written to the user stack. With a
      non-executable stack this would stop working. To avoid breaking
      things the protection fault handler checks the opcode that caused
      the fault for 0x0a77 (sys_sigreturn) and 0x0aad (sys_rt_sigreturn)
      and injects a system call. This is preferable to the alternative
      solution with a stub function in the vdso because it works for
      vdso=off and statically linked binaries as well.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      57d7f939
  3. 30 1月, 2017 1 次提交
  4. 24 1月, 2017 1 次提交
  5. 24 8月, 2016 2 次提交
  6. 06 7月, 2016 1 次提交
  7. 20 6月, 2016 3 次提交
    • D
      s390/mm: shadow pages with real guest requested protection · a9d23e71
      David Hildenbrand 提交于
      We really want to avoid manually handling protection for nested
      virtualization. By shadowing pages with the protection the guest asked us
      for, the SIE can handle most protection-related actions for us (e.g.
      special handling for MVPG) and we can directly forward protection
      exceptions to the guest.
      
      PTEs will now always be shadowed with the correct _PAGE_PROTECT flag.
      Unshadowing will take care of any guest changes to the parent PTE and
      any host changes to the host PTE. If the host PTE doesn't have the
      fitting access rights or is not available, we have to fix it up.
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      a9d23e71
    • M
      s390/mm: add shadow gmap support · 4be130a0
      Martin Schwidefsky 提交于
      For a nested KVM guest the outer KVM host needs to create shadow
      page tables for the nested guest. This patch adds the basic support
      to the guest address space (gmap) code.
      
      For each guest address space the inner KVM host creates, the first
      outer KVM host needs to create shadow page tables. The address space
      is identified by the ASCE loaded into the control register 1 at the
      time the inner SIE instruction for the second nested KVM guest is
      executed. The outer KVM host creates the shadow tables starting with
      the table identified by the ASCE on a on-demand basis. The outer KVM
      host will get repeated faults for all the shadow tables needed to
      run the second KVM guest.
      
      While a shadow page table for the second KVM guest is active the access
      to the origin region, segment and page tables needs to be restricted
      for the first KVM guest. For region and segment and page tables the first
      KVM guest may read the memory, but write attempt has to lead to an
      unshadow.  This is done using the page invalid and read-only bits in the
      page table of the first KVM guest. If the first guest re-accesses one of
      the origin pages of a shadow, it gets a fault and the affected parts of
      the shadow page table hierarchy needs to be removed again.
      
      PGSTE tables don't have to be shadowed, as all interpretation assist can't
      deal with the invalid bits in the shadow pte being set differently than
      the original ones provided by the first KVM guest.
      
      Many bug fixes and improvements by David Hildenbrand.
      Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      4be130a0
    • M
      s390/mm: extended gmap pte notifier · b2d73b2a
      Martin Schwidefsky 提交于
      The current gmap pte notifier forces a pte into to a read-write state.
      If the pte is invalidated the gmap notifier is called to inform KVM
      that the mapping will go away.
      
      Extend this approach to allow read-write, read-only and no-access
      as possible target states and call the pte notifier for any change
      to the pte.
      
      This mechanism is used to temporarily set specific access rights for
      a pte without doing the heavy work of a true mprotect call.
      Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      b2d73b2a
  8. 13 6月, 2016 3 次提交
    • M
      s390/mm: simplify the TLB flushing code · 64f31d58
      Martin Schwidefsky 提交于
      ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
      decide between a lazy TLB flush vs an immediate TLB flush. The field
      contains two 16-bit counters, the number of CPUs that have the mm
      attached and can create TLB entries for it and the number of CPUs in
      the middle of a page table update.
      
      The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
      use the attach counter and a mask check with mm_cpumask(mm) to decide
      between a local flush local of the current CPU and a global flush.
      
      For all these functions the decision between lazy vs immediate and
      local vs global TLB flush can be based on CPU masks. There are two
      masks:  the mm->context.cpu_attach_mask with the CPUs that are actively
      using the mm, and the mm_cpumask(mm) with the CPUs that have used the
      mm since the last full flush. The decision between lazy vs immediate
      flush is based on the mm->context.cpu_attach_mask, to decide between
      local vs global flush the mm_cpumask(mm) is used.
      
      With this patch all checks will use the CPU masks, the old counter
      mm->context.attach_count with its two 16-bit values is turned into a
      single counter mm->context.flush_count that keeps track of the number
      of CPUs with incomplete page table updates. The sole user of this
      counter is finish_arch_post_lock_switch() which waits for the end of
      all page table updates.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      64f31d58
    • M
      s390/mm: fix vunmap vs finish_arch_post_lock_switch · a9809407
      Martin Schwidefsky 提交于
      The vunmap_pte_range() function calls ptep_get_and_clear() without any
      locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct()
      for the page table update. ptep_flush_direct requires that preemption
      is disabled, but without any locking this is not the case. If the kernel
      preempts the task while the attach_counter is increased an endless loop
      in finish_arch_post_lock_switch() will occur the next time the task is
      scheduled.
      
      Add explicit preempt_disable()/preempt_enable() calls to the relevant
      functions in arch/s390/mm/pgtable.c.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      a9809407
    • C
      KVM: s390/mm: Fix CMMA reset during reboot · 1c343f7b
      Christian Borntraeger 提交于
      commit 1e133ab2 ("s390/mm: split arch/s390/mm/pgtable.c") factored
      out the page table handling code from __gmap_zap and  __s390_reset_cmma
      into ptep_zap_unused and added a simple flag that tells which one of the
      function (reset or not) is to be made. This also changed the behaviour,
      as it also zaps unused page table entries on reset.
      Turns out that this is wrong as s390_reset_cmma uses the page walker,
      which DOES NOT take the ptl lock.
      
      The most simple fix is to not do the zapping part on reset (which uses
      the walker)
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Fixes: 1e133ab2 ("s390/mm: split arch/s390/mm/pgtable.c")
      Cc: stable@vger.kernel.org # 4.6+
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      1c343f7b
  9. 10 6月, 2016 6 次提交
  10. 08 3月, 2016 3 次提交
  11. 02 3月, 2016 1 次提交
    • M
      s390/kvm: simplify set_guest_storage_key · 443a8133
      Martin Schwidefsky 提交于
      Git commit ab3f285f
      "KVM: s390/mm: try a cow on read only pages for key ops"
      added a fixup_user_fault to set_guest_storage_key force a copy on
      write if the page is mapped read-only. This is supposed to fix the
      problem of differing storage keys for shared mappings, e.g. the
      empty_zero_page.
      But if the storage key is set before the pte is mapped the storage
      key update is done on the pgste. A later fault will happily map the
      shared page with the key from the pgste.
      
      Eventually git commit 2faee8ff
      "s390/mm: prevent and break zero page mappings in case of storage keys"
      fixed this problem for the empty_zero_page. The commit makes sure that
      guests enabled for storage keys will not use the empty_zero_page at all.
      
      As the call to fixup_user_fault in set_guest_storage_key depends on the
      order of the storage key operation vs. the fault that maps the pte
      it does not really fix anything. Just remove it.
      Reviewed-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      443a8133
  12. 19 1月, 2016 1 次提交
  13. 16 1月, 2016 3 次提交
    • D
      s390/mm: enable fixup_user_fault retrying · fef8953a
      Dominik Dingel 提交于
      By passing a non-null flag we allow fixup_user_fault to retry, which
      enables userfaultfd.  As during these retries we might drop the mmap_sem
      we need to check if that happened and redo the complete chain of
      actions.
      Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fef8953a
    • D
      mm: bring in additional flag for fixup_user_fault to signal unlock · 4a9e1cda
      Dominik Dingel 提交于
      During Jason's work with postcopy migration support for s390 a problem
      regarding gmap faults was discovered.
      
      The gmap code will call fixup_user_fault which will end up always in
      handle_mm_fault.  Till now we never cared about retries, but as the
      userfaultfd code kind of relies on it.  this needs some fix.
      
      This patchset does not take care of the futex code.  I will now look
      closer at this.
      
      This patch (of 2):
      
      With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
      to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
      faulting we ever unlocked mmap_sem.
      
      This patch brings in the logic to handle retries as well as it cleans up
      the current documentation.  fixup_user_fault was not having the same
      semantics as filemap_fault.  It never indicated if a retry happened and
      so a caller wasn't able to handle that case.  So we now changed the
      behaviour to always retry a locked mmap_sem.
      Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a9e1cda
    • K
      s390, thp: remove infrastructure for handling splitting PMDs · fecffad2
      Kirill A. Shutemov 提交于
      With new refcounting we don't need to mark PMDs splitting.  Let's drop
      code to handle this.
      
      pmdp_splitting_flush() is not needed too: on splitting PMD we will do
      pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
      needed for fast_gup.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fecffad2
  14. 15 1月, 2016 1 次提交
  15. 16 12月, 2015 1 次提交
  16. 19 8月, 2015 1 次提交
    • M
      s390/mm: simplify page table alloc/free code · 78fb9076
      Martin Schwidefsky 提交于
      With the removal of the dynamic reallocation of page tables for
      KVM (see git commit 0b46e0a3)
      the page table allocation / freeing code can be simplified.
      
      The page table free code can now use the alloc_pgste bit in the
      mm context to decide if a page table is 2K or 4K, there is no mix
      of different sized page tables anymore. This eliminates the need
      to use "page->_mapcount == 0" to check for 4K page table.
      
      Use the lower two bits in page->_mapcount to indicate which
      2K fragments of the 4K page are in use.
      
      As 31-bit support is gone, remove the two defines ALLOC_ORDER
      and FRAG_MASK and use the constants directly where appropriate.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      78fb9076
  17. 18 7月, 2015 2 次提交
  18. 26 6月, 2015 2 次提交
  19. 23 4月, 2015 1 次提交
  20. 25 3月, 2015 1 次提交
    • H
      s390: remove 31 bit support · 5a79859a
      Heiko Carstens 提交于
      Remove the 31 bit support in order to reduce maintenance cost and
      effectively remove dead code. Since a couple of years there is no
      distribution left that comes with a 31 bit kernel.
      
      The 31 bit kernel also has been broken since more than a year before
      anybody noticed. In addition I added a removal warning to the kernel
      shown at ipl for 5 minutes: a960062e ("s390: add 31 bit warning
      message") which let everybody know about the plan to remove 31 bit
      code. We didn't get any response.
      
      Given that the last 31 bit only machine was introduced in 1999 let's
      remove the code.
      Anybody with 31 bit user space code can still use the compat mode.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      5a79859a
  21. 08 1月, 2015 2 次提交
  22. 28 11月, 2014 1 次提交
  23. 03 11月, 2014 1 次提交