1. 22 4月, 2014 1 次提交
  2. 03 4月, 2014 3 次提交
    • H
      s390/uaccess: rework uaccess code - fix locking issues · 457f2180
      Heiko Carstens 提交于
      The current uaccess code uses a page table walk in some circumstances,
      e.g. in case of the in atomic futex operations or if running on old
      hardware which doesn't support the mvcos instruction.
      
      However it turned out that the page table walk code does not correctly
      lock page tables when accessing page table entries.
      In other words: a different cpu may invalidate a page table entry while
      the current cpu inspects the pte. This may lead to random data corruption.
      
      Adding correct locking however isn't trivial for all uaccess operations.
      Especially copy_in_user() is problematic since that requires to hold at
      least two locks, but must be protected against ABBA deadlock when a
      different cpu also performs a copy_in_user() operation.
      
      So the solution is a different approach where we change address spaces:
      
      User space runs in primary address mode, or access register mode within
      vdso code, like it currently already does.
      
      The kernel usually also runs in home space mode, however when accessing
      user space the kernel switches to primary or secondary address mode if
      the mvcos instruction is not available or if a compare-and-swap (futex)
      instruction on a user space address is performed.
      KVM however is special, since that requires the kernel to run in home
      address space while implicitly accessing user space with the sie
      instruction.
      
      So we end up with:
      
      User space:
      - runs in primary or access register mode
      - cr1 contains the user asce
      - cr7 contains the user asce
      - cr13 contains the kernel asce
      
      Kernel space:
      - runs in home space mode
      - cr1 contains the user or kernel asce
        -> the kernel asce is loaded when a uaccess requires primary or
           secondary address mode
      - cr7 contains the user or kernel asce, (changed with set_fs())
      - cr13 contains the kernel asce
      
      In case of uaccess the kernel changes to:
      - primary space mode in case of a uaccess (copy_to_user) and uses
        e.g. the mvcp instruction to access user space. However the kernel
        will stay in home space mode if the mvcos instruction is available
      - secondary space mode in case of futex atomic operations, so that the
        instructions come from primary address space and data from secondary
        space
      
      In case of kvm the kernel runs in home space mode, but cr1 gets switched
      to contain the gmap asce before the sie instruction gets executed. When
      the sie instruction is finished cr1 will be switched back to contain the
      user asce.
      
      A context switch between two processes will always load the kernel asce
      for the next process in cr1. So the first exit to user space is a bit
      more expensive (one extra load control register instruction) than before,
      however keeps the code rather simple.
      
      In sum this means there is no need to perform any error prone page table
      walks anymore when accessing user space.
      
      The patch seems to be rather large, however it mainly removes the
      the page table walk code and restores the previously deleted "standard"
      uaccess code, with a couple of changes.
      
      The uaccess without mvcos mode can be enforced with the "uaccess_primary"
      kernel parameter.
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      457f2180
    • M
      s390/mm,tlb: optimize TLB flushing for zEC12 · 1b948d6c
      Martin Schwidefsky 提交于
      The zEC12 machines introduced the local-clearing control for the IDTE
      and IPTE instruction. If the control is set only the TLB of the local
      CPU is cleared of entries, either all entries of a single address space
      for IDTE, or the entry for a single page-table entry for IPTE.
      Without the local-clearing control the TLB flush is broadcasted to all
      CPUs in the configuration, which is expensive.
      
      The reset of the bit mask of the CPUs that need flushing after a
      non-local IDTE is tricky. As TLB entries for an address space remain
      in the TLB even if the address space is detached a new bit field is
      required to keep track of attached CPUs vs. CPUs in the need of a
      flush. After a non-local flush with IDTE the bit-field of attached CPUs
      is copied to the bit-field of CPUs in need of a flush. The ordering
      of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is
      such that an underindication in mm_cpumask(mm) is prevented but an
      overindication in mm_cpumask(mm) is possible.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      1b948d6c
    • M
      s390/mm,tlb: safeguard against speculative TLB creation · 02a8f3ab
      Martin Schwidefsky 提交于
      The principles of operations states that the CPU is allowed to create
      TLB entries for an address space anytime while an ASCE is loaded to
      the control register. This is true even if the CPU is running in the
      kernel and the user address space is not (actively) accessed.
      
      In theory this can affect two aspects of the TLB flush logic.
      For full-mm flushes the ASCE of the dying process is still attached.
      The approach to flush first with IDTE and then just free all page
      tables can in theory lead to stale TLB entries. Use the batched
      free of page tables for the full-mm flushes as well.
      
      For operations that can have a stale ASCE in the control register,
      e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE
      to prevent invalid TLBs from being created.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      02a8f3ab
  3. 21 2月, 2014 1 次提交
    • M
      s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries · 53e857f3
      Martin Schwidefsky 提交于
      Git commit 050eef36 "[S390] fix tlb flushing vs. concurrent
      /proc accesses" introduced the attach counter to avoid using the
      mm_users value to decide between IPTE for every PTE and lazy TLB
      flushing with IDTE. That fixed the problem with mm_users but it
      introduced another subtle race, fortunately one that is very hard
      to hit.
      The background is the requirement of the architecture that a valid
      PTE may not be changed while it can be used concurrently by another
      cpu. The decision between IPTE and lazy TLB flushing needs to be
      done while the PTE is still valid. Now if the virtual cpu is
      temporarily stopped after the decision to use lazy TLB flushing but
      before the invalid bit of the PTE has been set, another cpu can attach
      the mm, find that flush_mm is set, do the IDTE, return to userspace,
      and recreate a TLB that uses the PTE in question. When the first,
      stopped cpu continues it will change the PTE while it is attached on
      another cpu. The first cpu will do another IDTE shortly after the
      modification of the PTE which makes the race window quite short.
      
      To fix this race the CPU that wants to attach the address space of a
      user space thread needs to wait for the end of the PTE modification.
      The number of concurrent TLB flushers for an mm is tracked in the
      upper 16 bits of the attach_count and finish_arch_post_lock_switch
      is used to wait for the end of the flush operation if required.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      53e857f3
  4. 24 10月, 2013 1 次提交
  5. 22 8月, 2013 1 次提交
  6. 29 7月, 2013 1 次提交
  7. 26 9月, 2012 1 次提交
  8. 30 7月, 2012 1 次提交
  9. 26 7月, 2012 1 次提交
  10. 20 7月, 2012 1 次提交
    • H
      s390/comments: unify copyright messages and remove file names · a53c8fab
      Heiko Carstens 提交于
      Remove the file name from the comment at top of many files. In most
      cases the file name was wrong anyway, so it's rather pointless.
      
      Also unify the IBM copyright statement. We did have a lot of sightly
      different statements and wanted to change them one after another
      whenever a file gets touched. However that never happened. Instead
      people start to take the old/"wrong" statements to use as a template
      for new files.
      So unify all of them in one go.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      a53c8fab
  11. 24 5月, 2012 1 次提交
  12. 29 3月, 2012 1 次提交
  13. 23 5月, 2011 1 次提交
    • M
      [S390] Remove data execution protection · 043d0708
      Martin Schwidefsky 提交于
      The noexec support on s390 does not rely on a bit in the page table
      entry but utilizes the secondary space mode to distinguish between
      memory accesses for instructions vs. data. The noexec code relies
      on the assumption that the cpu will always use the secondary space
      page table for data accesses while it is running in the secondary
      space mode. Up to the z9-109 class machines this has been the case.
      Unfortunately this is not true anymore with z10 and later machines.
      The load-relative-long instructions lrl, lgrl and lgfrl access the
      memory operand using the same addressing-space mode that has been
      used to fetch the instruction.
      This breaks the noexec mode for all user space binaries compiled
      with march=z10 or later. The only option is to remove the current
      noexec support.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      043d0708
  14. 10 5月, 2011 1 次提交
  15. 24 8月, 2010 1 次提交
    • M
      [S390] fix tlb flushing vs. concurrent /proc accesses · 050eef36
      Martin Schwidefsky 提交于
      The tlb flushing code uses the mm_users field of the mm_struct to
      decide if each page table entry needs to be flushed individually with
      IPTE or if a global flush for the mm_struct is sufficient after all page
      table updates have been done. The comment for mm_users says "How many
      users with user space?" but the /proc code increases mm_users after it
      found the process structure by pid without creating a new user process.
      Which makes mm_users useless for the decision between the two tlb
      flusing methods. The current code can be confused to not flush tlb
      entries by a concurrent access to /proc files if e.g. a fork is in
      progres. The solution for this problem is to make the tlb flushing
      logic independent from the mm_users field.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      050eef36
  16. 07 12月, 2009 1 次提交
    • M
      [S390] Improve address space mode selection. · b11b5334
      Martin Schwidefsky 提交于
      Introduce user_mode to replace the two variables switch_amode and
      s390_noexec. There are three valid combinations of the old values:
        1) switch_amode == 0 && s390_noexec == 0
        2) switch_amode == 1 && s390_noexec == 0
        3) switch_amode == 1 && s390_noexec == 1
      They get replaced by
        1) user_mode == HOME_SPACE_MODE
        2) user_mode == PRIMARY_SPACE_MODE
        3) user_mode == SECONDARY_SPACE_MODE
      The new kernel parameter user_mode=[primary,secondary,home] lets
      you choose the address space mode the user space processes should
      use. In addition the CONFIG_S390_SWITCH_AMODE config option
      is removed.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      b11b5334
  17. 26 3月, 2009 1 次提交
  18. 28 10月, 2008 1 次提交
    • C
      [S390] pgtables: Fix race in enable_sie vs. page table ops · 250cf776
      Christian Borntraeger 提交于
      The current enable_sie code sets the mm->context.pgstes bit to tell
      dup_mm that the new mm should have extended page tables. This bit is also
      used by the s390 specific page table primitives to decide about the page
      table layout - which means context.pgstes has two meanings. This can cause
      any kind of bugs. For example  - e.g. shrink_zone can call
      ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young
      will test for context.pgstes. Since enable_sie changed that value of the old
      struct mm without changing the page table layout ptep_clear_flush_young will
      do the wrong thing.
      The solution is to split pgstes into two bits
      - one for the allocation
      - one for the current state
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      250cf776
  19. 02 8月, 2008 1 次提交
  20. 27 4月, 2008 1 次提交
    • C
      s390: KVM preparation: provide hook to enable pgstes in user pagetable · 402b0862
      Carsten Otte 提交于
      The SIE instruction on s390 uses the 2nd half of the page table page to
      virtualize the storage keys of a guest. This patch offers the s390_enable_sie
      function, which reorganizes the page tables of a single-threaded process to
      reserve space in the page table:
      s390_enable_sie makes sure that the process is single threaded and then uses
      dup_mm to create a new mm with reorganized page tables. The old mm is freed
      and the process has now a page status extended field after every page table.
      
      Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.
      
      This patch has a small common code hit, namely making dup_mm non-static.
      
      Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
      review feedback. Now we do have the prototype for dup_mm in
      include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
      call task_lock() to prevent race against ptrace modification of mm_users.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      402b0862
  21. 10 2月, 2008 3 次提交
  22. 26 1月, 2008 1 次提交
    • M
      [S390] Fix tlb flushing with idte. · 6f457e1a
      Martin Schwidefsky 提交于
      The clear-by-asce operation of the idte instruction gets an asce
      (address-space-control-element) as argument to specify which TLBs
      need to get flushed. The current code passes a plain pointer to
      the start of the pgd without the additional bits which would make
      the pointer an asce. The current machines don't mind the difference
      but a future model might want to use the designation type control
      bits in the asce as a filter for the TLBs to flush.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      6f457e1a
  23. 22 10月, 2007 1 次提交
  24. 03 5月, 2007 1 次提交
    • J
      [PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction · d6dd61c8
      Jeremy Fitzhardinge 提交于
      Add hooks to allow a paravirt implementation to track the lifetime of
      an mm.  Paravirtualization requires three hooks, but only two are
      needed in common code.  They are:
      
      arch_dup_mmap, which is called when a new mmap is created at fork
      
      arch_exit_mmap, which is called when the last process reference to an
        mm is dropped, which typically happens on exit and exec.
      
      The third hook is activate_mm, which is called from the arch-specific
      activate_mm() macro/function, and so doesn't need stub versions for
      other architectures.  It's called when an mm is first used.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: linux-arch@vger.kernel.org
      Cc: James Bottomley <James.Bottomley@SteelEye.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      d6dd61c8
  25. 06 2月, 2007 1 次提交
    • G
      [S390] noexec protection · c1821c2e
      Gerald Schaefer 提交于
      This provides a noexec protection on s390 hardware. Our hardware does
      not have any bits left in the pte for a hw noexec bit, so this is a
      different approach using shadow page tables and a special addressing
      mode that allows separate address spaces for code and data.
      
      As a special feature of our "secondary-space" addressing mode, separate
      page tables can be specified for the translation of data addresses
      (storage operands) and instruction addresses. The shadow page table is
      used for the instruction addresses and the standard page table for the
      data addresses.
      The shadow page table is linked to the standard page table by a pointer
      in page->lru.next of the struct page corresponding to the page that
      contains the standard page table (since page->private is not really
      private with the pte_lock and the page table pages are not in the LRU
      list).
      Depending on the software bits of a pte, it is either inserted into
      both page tables or just into the standard (data) page table. Pages of
      a vma that does not have the VM_EXEC bit set get mapped only in the
      data address space. Any try to execute code on such a page will cause a
      page translation exception. The standard reaction to this is a SIGSEGV
      with two exceptions: the two system call opcodes 0x0a77 (sys_sigreturn)
      and 0x0aad (sys_rt_sigreturn) are allowed. They are stored by the
      kernel to the signal stack frame. Unfortunately, the signal return
      mechanism cannot be modified to use an SA_RESTORER because the
      exception unwinding code depends on the system call opcode stored
      behind the signal stack frame.
      
      This feature requires that user space is executed in secondary-space
      mode and the kernel in home-space mode, which means that the addressing
      modes need to be switched and that the noexec protection only works
      for user space.
      After switching the addressing modes, we cannot use the mvcp/mvcs
      instructions anymore to copy between kernel and user space. A new
      mvcos instruction has been added to the z9 EC/BC hardware which allows
      to copy between arbitrary address spaces, but on older hardware the
      page tables need to be walked manually.
      Signed-off-by: NGerald Schaefer <geraldsc@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      c1821c2e
  26. 09 11月, 2005 1 次提交
  27. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4