1. 21 11月, 2019 1 次提交
    • V
      s390/vdso: avoid 64-bit vdso mapping for compat tasks · 84bfa034
      Vasily Gorbik 提交于
      [ Upstream commit d1befa65823e9c6d013883b8a41d081ec338c489 ]
      
      vdso_fault used is_compat_task function (on s390 it tests "current"
      thread_info flags) to distinguish compat tasks and map 31-bit vdso
      pages. But "current" task might not correspond to mm context.
      
      When 31-bit compat inferior is executed under gdb, gdb does
      PTRACE_PEEKTEXT on vdso page, causing vdso_fault with "current" being
      64-bit gdb process. So, 31-bit inferior ends up with 64-bit vdso mapped.
      
      To avoid this problem a new compat_mm flag has been introduced into
      mm context. This flag is used in vdso_fault and vdso_mremap instead
      of is_compat_task.
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      84bfa034
  2. 31 1月, 2019 1 次提交
  3. 27 11月, 2018 1 次提交
    • M
      s390/mm: fix mis-accounting of pgtable_bytes · 4136161d
      Martin Schwidefsky 提交于
      [ Upstream commit e12e4044aede97974f2222eb7f0ed726a5179a32 ]
      
      In case a fork or a clone system fails in copy_process and the error
      handling does the mmput() at the bad_fork_cleanup_mm label, the
      following warning messages will appear on the console:
      
        BUG: non-zero pgtables_bytes on freeing mm: 16384
      
      The reason for that is the tricks we play with mm_inc_nr_puds() and
      mm_inc_nr_pmds() in init_new_context().
      
      A normal 64-bit process has 3 levels of page table, the p4d level and
      the pud level are folded. On process termination the free_pud_range()
      function in mm/memory.c will subtract 16KB from pgtable_bytes with a
      mm_dec_nr_puds() call, but there actually is not really a pud table.
      
      One issue with this is the fact that pgtable_bytes is usually off
      by a few kilobytes, but the more severe problem is that for a failed
      fork or clone the free_pgtables() function is not called. In this case
      there is no mm_dec_nr_puds() or mm_dec_nr_pmds() that go together with
      the mm_inc_nr_puds() and mm_inc_nr_pmds in init_new_context().
      The pgtable_bytes will be off by 16384 or 32768 bytes and we get the
      BUG message. The message itself is purely cosmetic, but annoying.
      
      To fix this override the mm_pmd_folded, mm_pud_folded and mm_p4d_folded
      function to check for the true size of the address space.
      Reported-by: NLi Wang <liwang@redhat.com>
      Tested-by: NLi Wang <liwang@redhat.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4136161d
  4. 31 7月, 2018 1 次提交
  5. 17 5月, 2018 1 次提交
  6. 09 3月, 2018 1 次提交
  7. 02 3月, 2018 1 次提交
    • G
      s390: Fix runtime warning about negative pgtables_bytes · 61e18270
      Guenter Roeck 提交于
      When running s390 images with 'compat' processes, the following
      BUG is seen repeatedly.
      
      BUG: non-zero pgtables_bytes on freeing mm: -16384
      
      Bisect points to commit b4e98d9a ("mm: account pud page tables").
      Analysis shows that init_new_context() is called with
      mm->context.asce_limit set to _REGION3_SIZE. In this situation,
      pgtables_bytes remains set to 0 and is not increased. The message is
      displayed when the affected process dies and mm_dec_nr_puds() is called.
      
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Fixes: b4e98d9a ("mm: account pud page tables")
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      61e18270
  8. 24 11月, 2017 1 次提交
    • M
      s390: fix alloc_pgste check in init_new_context again · 53c4ab70
      Martin Schwidefsky 提交于
      git commit badb8bb9 "fix alloc_pgste check in init_new_context" fixed
      the problem of 'current->mm == NULL' in init_new_context back in 2011.
      
      git commit 3eabaee9 "KVM: s390: allow sie enablement for multi-
      threaded programs" completely removed the check against alloc_pgste.
      
      git commit 23fefe11 "s390/kvm: avoid global config of vm.alloc_pgste=1"
      re-added a check against the alloc_pgste flag but without the required
      check for current->mm != NULL.
      
      For execve() called by a kernel thread init_new_context() reads from
      ((struct mm_struct *) NULL)->context.alloc_pgste to decide between
      2K vs 4K page tables. If the bit happens to be set for the init process
      it will be created with large page tables. This decision is inherited by
      all the children of init, this waste quite some memory.
      
      Re-add the check for 'current->mm != NULL'.
      
      Fixes: 23fefe11 ("s390/kvm: avoid global config of vm.alloc_pgste=1")
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      53c4ab70
  9. 16 11月, 2017 1 次提交
  10. 14 11月, 2017 1 次提交
    • M
      s390: remove all code using the access register mode · 0aaba41b
      Martin Schwidefsky 提交于
      The vdso code for the getcpu() and the clock_gettime() call use the access
      register mode to access the per-CPU vdso data page with the current code.
      
      An alternative to the complicated AR mode is to use the secondary space
      mode. This makes the vdso faster and quite a bit simpler. The downside is
      that the uaccess code has to be changed quite a bit.
      
      Which instructions are used depends on the machine and what kind of uaccess
      operation is requested. The instruction dictates which ASCE value needs
      to be loaded into %cr1 and %cr7.
      
      The different cases:
      
      * User copy with MVCOS for z10 and newer machines
        The MVCOS instruction can copy between the primary space (aka user) and
        the home space (aka kernel) directly. For set_fs(KERNEL_DS) the kernel
        ASCE is loaded into %cr1. For set_fs(USER_DS) the user space is already
        loaded in %cr1.
      
      * User copy with MVCP/MVCS for older machines
        To be able to execute the MVCP/MVCS instructions the kernel needs to
        switch to primary mode. The control register %cr1 has to be set to the
        kernel ASCE and %cr7 to either the kernel ASCE or the user ASCE dependent
        on set_fs(KERNEL_DS) vs set_fs(USER_DS).
      
      * Data access in the user address space for strnlen / futex
        To use "normal" instruction with data from the user address space the
        secondary space mode is used. The kernel needs to switch to primary mode,
        %cr1 has to contain the kernel ASCE and %cr7 either the user ASCE or the
        kernel ASCE, dependent on set_fs.
      
      To load a new value into %cr1 or %cr7 is an expensive operation, the kernel
      tries to be lazy about it. E.g. for multiple user copies in a row with
      MVCP/MVCS the replacement of the vdso ASCE in %cr7 with the user ASCE is
      done only once. On return to user space a CPU bit is checked that loads the
      vdso ASCE again.
      
      To enable and disable the data access via the secondary space two new
      functions are added, enable_sacf_uaccess and disable_sacf_uaccess. The fact
      that a context is in secondary space uaccess mode is stored in the
      mm_segment_t value for the task. The code of an interrupt may use set_fs
      as long as it returns to the previous state it got with get_fs with another
      call to set_fs. The code in finish_arch_post_lock_switch simply has to do a
      set_fs with the current mm_segment_t value for the task.
      
      For CPUs with MVCOS:
      
      CPU running in                        | %cr1 ASCE | %cr7 ASCE |
      --------------------------------------|-----------|-----------|
      user space                            |  user     |  vdso     |
      kernel, USER_DS, normal-mode          |  user     |  vdso     |
      kernel, USER_DS, normal-mode, lazy    |  user     |  user     |
      kernel, USER_DS, sacf-mode            |  kernel   |  user     |
      kernel, KERNEL_DS, normal-mode        |  kernel   |  vdso     |
      kernel, KERNEL_DS, normal-mode, lazy  |  kernel   |  kernel   |
      kernel, KERNEL_DS, sacf-mode          |  kernel   |  kernel   |
      
      For CPUs without MVCOS:
      
      CPU running in                        | %cr1 ASCE | %cr7 ASCE |
      --------------------------------------|-----------|-----------|
      user space                            |  user     |  vdso     |
      kernel, USER_DS, normal-mode          |  user     |  vdso     |
      kernel, USER_DS, normal-mode lazy     |  kernel   |  user     |
      kernel, USER_DS, sacf-mode            |  kernel   |  user     |
      kernel, KERNEL_DS, normal-mode        |  kernel   |  vdso     |
      kernel, KERNEL_DS, normal-mode, lazy  |  kernel   |  kernel   |
      kernel, KERNEL_DS, sacf-mode          |  kernel   |  kernel   |
      
      The lines with "lazy" refer to the state after a copy via the secondary
      space with a delayed reload of %cr1 and %cr7.
      
      There are three hardware address spaces that can cause a DAT exception,
      primary, secondary and home space. The exception can be related to
      four different fault types: user space fault, vdso fault, kernel fault,
      and the gmap faults.
      
      Dependent on the set_fs state and normal vs. sacf mode there are a number
      of fault combinations:
      
      1) user address space fault via the primary ASCE
      2) gmap address space fault via the primary ASCE
      3) kernel address space fault via the primary ASCE for machines with
         MVCOS and set_fs(KERNEL_DS)
      4) vdso address space faults via the secondary ASCE with an invalid
         address while running in secondary space in problem state
      5) user address space fault via the secondary ASCE for user-copy
         based on the secondary space mode, e.g. futex_ops or strnlen_user
      6) kernel address space fault via the secondary ASCE for user-copy
         with secondary space mode with set_fs(KERNEL_DS)
      7) kernel address space fault via the primary ASCE for user-copy
         with secondary space mode with set_fs(USER_DS) on machines without
         MVCOS.
      8) kernel address space fault via the home space ASCE
      
      Replace user_space_fault() with a new function get_fault_type() that
      can distinguish all four different fault types.
      
      With these changes the futex atomic ops from the kernel and the
      strnlen_user will get a little bit slower, as well as the old style
      uaccess with MVCP/MVCS. All user accesses based on MVCOS will be as
      fast as before. On the positive side, the user space vdso code is a
      lot faster and Linux ceases to use the complicated AR mode.
      Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      0aaba41b
  11. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  12. 06 9月, 2017 3 次提交
  13. 31 8月, 2017 1 次提交
  14. 29 8月, 2017 1 次提交
  15. 26 7月, 2017 1 次提交
  16. 13 6月, 2017 1 次提交
    • M
      s390/kvm: avoid global config of vm.alloc_pgste=1 · 23fefe11
      Martin Schwidefsky 提交于
      The system control vm.alloc_pgste is used to control the size of the
      page tables, either 2K or 4K. The idea is that a KVM host sets the
      vm.alloc_pgste control to 1 which causes *all* new processes to run
      with 4K page tables. For a non-kvm system the control should stay off
      to save on memory used for page tables.
      
      Trouble is that distributions choose to set the control globally to
      be able to run KVM guests. This wastes memory on non-KVM systems.
      
      Introduce the PT_S390_PGSTE ELF segment type to "mark" the qemu
      executable with it. All executables with this (empty) segment in
      its ELF phdr array will be started with 4K page tables. Any executable
      without PT_S390_PGSTE will run with the default 2K page tables.
      
      This removes the need to set vm.alloc_pgste=1 for a KVM host and
      minimizes the waste of memory for page tables.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      23fefe11
  17. 20 4月, 2017 1 次提交
  18. 18 3月, 2017 1 次提交
  19. 02 3月, 2017 1 次提交
  20. 23 2月, 2017 1 次提交
  21. 25 12月, 2016 1 次提交
  22. 24 8月, 2016 1 次提交
  23. 20 6月, 2016 1 次提交
  24. 13 6月, 2016 1 次提交
    • M
      s390/mm: simplify the TLB flushing code · 64f31d58
      Martin Schwidefsky 提交于
      ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
      decide between a lazy TLB flush vs an immediate TLB flush. The field
      contains two 16-bit counters, the number of CPUs that have the mm
      attached and can create TLB entries for it and the number of CPUs in
      the middle of a page table update.
      
      The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
      use the attach counter and a mask check with mm_cpumask(mm) to decide
      between a local flush local of the current CPU and a global flush.
      
      For all these functions the decision between lazy vs immediate and
      local vs global TLB flush can be based on CPU masks. There are two
      masks:  the mm->context.cpu_attach_mask with the CPUs that are actively
      using the mm, and the mm_cpumask(mm) with the CPUs that have used the
      mm since the last full flush. The decision between lazy vs immediate
      flush is based on the mm->context.cpu_attach_mask, to decide between
      local vs global flush the mm_cpumask(mm) is used.
      
      With this patch all checks will use the CPU masks, the old counter
      mm->context.attach_count with its two 16-bit values is turned into a
      single counter mm->context.flush_count that keeps track of the number
      of CPUs with incomplete page table updates. The sole user of this
      counter is finish_arch_post_lock_switch() which waits for the end of
      all page table updates.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      64f31d58
  25. 21 4月, 2016 1 次提交
    • G
      s390/mm: fix asce_bits handling with dynamic pagetable levels · 723cacbd
      Gerald Schaefer 提交于
      There is a race with multi-threaded applications between context switch and
      pagetable upgrade. In switch_mm() a new user_asce is built from mm->pgd and
      mm->context.asce_bits, w/o holding any locks. A concurrent mmap with a
      pagetable upgrade on another thread in crst_table_upgrade() could already
      have set new asce_bits, but not yet the new mm->pgd. This would result in a
      corrupt user_asce in switch_mm(), and eventually in a kernel panic from a
      translation exception.
      
      Fix this by storing the complete asce instead of just the asce_bits, which
      can then be read atomically from switch_mm(), so that it either sees the
      old value or the new value, but no mixture. Both cases are OK. Having the
      old value would result in a page fault on access to the higher level memory,
      but the fault handler would see the new mm->pgd, if it was a valid access
      after the mmap on the other thread has completed. So as worst-case scenario
      we would have a page fault loop for the racing thread until the next time
      slice.
      
      Also remove dead code and simplify the upgrade/downgrade path, there are no
      upgrades from 2 levels, and only downgrades from 3 levels for compat tasks.
      There are also no concurrent upgrades, because the mmap_sem is held with
      down_write() in do_mmap, so the flush and table checks during upgrade can
      be removed.
      Reported-by: NMichael Munday <munday@ca.ibm.com>
      Reviewed-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      723cacbd
  26. 10 3月, 2016 1 次提交
    • M
      s390/mm: four page table levels vs. fork · 3446c13b
      Martin Schwidefsky 提交于
      The fork of a process with four page table levels is broken since
      git commit 6252d702 "[S390] dynamic page tables."
      
      All new mm contexts are created with three page table levels and
      an asce limit of 4TB. If the parent has four levels dup_mmap will
      add vmas to the new context which are outside of the asce limit.
      The subsequent call to copy_page_range will walk the three level
      page table structure of the new process with non-zero pgd and pud
      indexes. This leads to memory clobbers as the pgd_index *and* the
      pud_index is added to the mm->pgd pointer without a pgd_deref
      in between.
      
      The init_new_context() function is selecting the number of page
      table levels for a new context. The function is used by mm_init()
      which in turn is called by dup_mm() and mm_alloc(). These two are
      used by fork() and exec(). The init_new_context() function can
      distinguish the two cases by looking at mm->context.asce_limit,
      for fork() the mm struct has been copied and the number of page
      table levels may not change. For exec() the mm_alloc() function
      set the new mm structure to zero, in this case a three-level page
      table is created as the temporary stack space is located at
      STACK_TOP_MAX = 4TB.
      
      This fixes CVE-2016-2143.
      Reported-by: NMarcin Kościelnicki <koriakin@0x04.net>
      Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      3446c13b
  27. 19 2月, 2016 2 次提交
    • D
      mm/core, x86/mm/pkeys: Differentiate instruction fetches · d61172b4
      Dave Hansen 提交于
      As discussed earlier, we attempt to enforce protection keys in
      software.
      
      However, the code checks all faults to ensure that they are not
      violating protection key permissions.  It was assumed that all
      faults are either write faults where we check PKRU[key].WD (write
      disable) or read faults where we check the AD (access disable)
      bit.
      
      But, there is a third category of faults for protection keys:
      instruction faults.  Instruction faults never run afoul of
      protection keys because they do not affect instruction fetches.
      
      So, plumb the PF_INSTR bit down in to the
      arch_vma_access_permitted() function where we do the protection
      key checks.
      
      We also add a new FAULT_FLAG_INSTRUCTION.  This is because
      handle_mm_fault() is not passed the architecture-specific
      error_code where we keep PF_INSTR, so we need to encode the
      instruction fetch information in to the arch-generic fault
      flags.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d61172b4
    • D
      mm/core: Do not enforce PKEY permissions on remote mm access · 1b2ee126
      Dave Hansen 提交于
      We try to enforce protection keys in software the same way that we
      do in hardware.  (See long example below).
      
      But, we only want to do this when accessing our *own* process's
      memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
      tried to PTRACE_POKE a target process which just happened to have
      some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
      debugger access to that memory.  PKRU is fundamentally a
      thread-local structure and we do not want to enforce it on access
      to _another_ thread's data.
      
      This gets especially tricky when we have workqueues or other
      delayed-work mechanisms that might run in a random process's context.
      We can check that we only enforce pkeys when operating on our *own* mm,
      but delayed work gets performed when a random user context is active.
      We might end up with a situation where a delayed-work gup fails when
      running randomly under its "own" task but succeeds when running under
      another process.  We want to avoid that.
      
      To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
      fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
      walking an mm which is not guranteed to be the same as
      current->mm and should not be subject to protection key
      enforcement.
      
      Thanks to Jerome Glisse for pointing out this scenario.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b2ee126
  28. 18 2月, 2016 1 次提交
    • D
      mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys · 33a709b2
      Dave Hansen 提交于
      Today, for normal faults and page table walks, we check the VMA
      and/or PTE to ensure that it is compatible with the action.  For
      instance, if we get a write fault on a non-writeable VMA, we
      SIGSEGV.
      
      We try to do the same thing for protection keys.  Basically, we
      try to make sure that if a user does this:
      
      	mprotect(ptr, size, PROT_NONE);
      	*ptr = foo;
      
      they see the same effects with protection keys when they do this:
      
      	mprotect(ptr, size, PROT_READ|PROT_WRITE);
      	set_pkey(ptr, size, 4);
      	wrpkru(0xffffff3f); // access disable pkey 4
      	*ptr = foo;
      
      The state to do that checking is in the VMA, but we also
      sometimes have to do it on the page tables only, like when doing
      a get_user_pages_fast() where we have no VMA.
      
      We add two functions and expose them to generic code:
      
      	arch_pte_access_permitted(pte_flags, write)
      	arch_vma_access_permitted(vma, write)
      
      These are, of course, backed up in x86 arch code with checks
      against the PTE or VMA's protection key.
      
      But, there are also cases where we do not want to respect
      protection keys.  When we ptrace(), for instance, we do not want
      to apply the tracer's PKRU permissions to the PTEs from the
      process being traced.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33a709b2
  29. 23 4月, 2015 1 次提交
  30. 25 3月, 2015 1 次提交
    • H
      s390: remove 31 bit support · 5a79859a
      Heiko Carstens 提交于
      Remove the 31 bit support in order to reduce maintenance cost and
      effectively remove dead code. Since a couple of years there is no
      distribution left that comes with a 31 bit kernel.
      
      The 31 bit kernel also has been broken since more than a year before
      anybody noticed. In addition I added a removal warning to the kernel
      shown at ipl for 5 minutes: a960062e ("s390: add 31 bit warning
      message") which let everybody know about the plan to remove 31 bit
      code. We didn't get any response.
      
      Given that the last 31 bit only machine was introduced in 1999 let's
      remove the code.
      Anybody with 31 bit user space code can still use the compat mode.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      5a79859a
  31. 03 3月, 2015 1 次提交
  32. 19 11月, 2014 1 次提交
    • D
      mm: Make arch_unmap()/bprm_mm_init() available to all architectures · 62e88b1c
      Dave Hansen 提交于
      The x86 MPX patch set calls arch_unmap() and arch_bprm_mm_init()
      from fs/exec.c, so we need at least a stub for them in all
      architectures.  They are only called under an #ifdef for
      CONFIG_MMU=y, so we can at least restict this to architectures
      with MMU support.
      
      blackfin/c6x have no MMU support, so do not call arch_unmap().
      They also do not include mm_hooks.h or mmu_context.h at all and
      do not need to be touched.
      
      s390, um and unicore32 do not use asm-generic/mm_hooks.h, so got
      their own arch_unmap() versions.  (I also moved um's
      arch_dup_mmap() to be closer to the other mm_hooks.h functions).
      
      xtensa only includes mm_hooks when MMU=y, which should be fine
      since arch_unmap() is called only from MMU=y code.
      
      For the rest, we use the stub copies of these functions in
      asm-generic/mm_hook.h.
      
      I cross compiled defconfigs for cris (to check NOMMU) and s390
      to make sure that this works.  I also checked a 64-bit build
      of UML and all my normal x86 builds including PARAVIRT on and
      off.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182350.8B4AA2C2@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      62e88b1c
  33. 10 6月, 2014 1 次提交
    • M
      s390/uaccess: always load the kernel ASCE after task switch · f8b13505
      Martin Schwidefsky 提交于
      This patch fixes a problem introduced with git commit beef560b
      "s390/uaccess: simplify control register updates".
      
      The switch_mm function is not called if the next process is a kernel
      thread without an attached mm or is a nop if the mm does not change.
      But CR1 still needs to be loaded with the kernel ASCE in case the
      code returns to a uaccess function that uses the secondary space mode.
      
      In addition move the set_fs call from finish_arch_switch to
      finish_arch_post_lock_switch and then remove finish_arch_switch.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      f8b13505
  34. 20 5月, 2014 2 次提交
    • M
      s390: split TIF bits into CIF, PIF and TIF bits · d3a73acb
      Martin Schwidefsky 提交于
      The oi and ni instructions used in entry[64].S to set and clear bits
      in the thread-flags are not guaranteed to be atomic in regard to other
      CPUs. Split the TIF bits into CPU, pt_regs and thread-info specific
      bits. Updates on the TIF bits are done with atomic instructions,
      updates on CPU and pt_regs bits are done with non-atomic instructions.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      d3a73acb
    • M
      s390/uaccess: simplify control register updates · beef560b
      Martin Schwidefsky 提交于
      Always switch to the kernel ASCE in switch_mm. Load the secondary
      space ASCE in finish_arch_post_lock_switch after checking that
      any pending page table operations have completed. The primary
      ASCE is loaded in entry[64].S. With this the update_primary_asce
      call can be removed from the switch_to macro and from the start
      of switch_mm function. Remove the load_primary argument from
      update_user_asce/clear_user_asce, rename update_user_asce to
      set_user_asce and rename update_primary_asce to load_kernel_asce.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      beef560b
  35. 22 4月, 2014 2 次提交