1. 16 1月, 2016 1 次提交
    • K
      s390, thp: remove infrastructure for handling splitting PMDs · fecffad2
      Kirill A. Shutemov 提交于
      With new refcounting we don't need to mark PMDs splitting.  Let's drop
      code to handle this.
      
      pmdp_splitting_flush() is not needed too: on splitting PMD we will do
      pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
      needed for fast_gup.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fecffad2
  2. 15 1月, 2016 1 次提交
  3. 16 12月, 2015 1 次提交
  4. 19 8月, 2015 1 次提交
    • M
      s390/mm: simplify page table alloc/free code · 78fb9076
      Martin Schwidefsky 提交于
      With the removal of the dynamic reallocation of page tables for
      KVM (see git commit 0b46e0a3)
      the page table allocation / freeing code can be simplified.
      
      The page table free code can now use the alloc_pgste bit in the
      mm context to decide if a page table is 2K or 4K, there is no mix
      of different sized page tables anymore. This eliminates the need
      to use "page->_mapcount == 0" to check for 4K page table.
      
      Use the lower two bits in page->_mapcount to indicate which
      2K fragments of the 4K page are in use.
      
      As 31-bit support is gone, remove the two defines ALLOC_ORDER
      and FRAG_MASK and use the constants directly where appropriate.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      78fb9076
  5. 18 7月, 2015 2 次提交
  6. 26 6月, 2015 2 次提交
  7. 23 4月, 2015 1 次提交
  8. 25 3月, 2015 1 次提交
    • H
      s390: remove 31 bit support · 5a79859a
      Heiko Carstens 提交于
      Remove the 31 bit support in order to reduce maintenance cost and
      effectively remove dead code. Since a couple of years there is no
      distribution left that comes with a 31 bit kernel.
      
      The 31 bit kernel also has been broken since more than a year before
      anybody noticed. In addition I added a removal warning to the kernel
      shown at ipl for 5 minutes: a960062e ("s390: add 31 bit warning
      message") which let everybody know about the plan to remove 31 bit
      code. We didn't get any response.
      
      Given that the last 31 bit only machine was introduced in 1999 let's
      remove the code.
      Anybody with 31 bit user space code can still use the compat mode.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      5a79859a
  9. 08 1月, 2015 2 次提交
  10. 28 11月, 2014 1 次提交
  11. 03 11月, 2014 1 次提交
  12. 28 10月, 2014 1 次提交
  13. 27 10月, 2014 4 次提交
  14. 15 10月, 2014 1 次提交
  15. 29 8月, 2014 1 次提交
  16. 26 8月, 2014 2 次提交
  17. 25 8月, 2014 3 次提交
  18. 01 8月, 2014 2 次提交
  19. 20 5月, 2014 1 次提交
    • M
      s390/uaccess: simplify control register updates · beef560b
      Martin Schwidefsky 提交于
      Always switch to the kernel ASCE in switch_mm. Load the secondary
      space ASCE in finish_arch_post_lock_switch after checking that
      any pending page table operations have completed. The primary
      ASCE is loaded in entry[64].S. With this the update_primary_asce
      call can be removed from the switch_to macro and from the start
      of switch_mm function. Remove the load_primary argument from
      update_user_asce/clear_user_asce, rename update_user_asce to
      set_user_asce and rename update_primary_asce to load_kernel_asce.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      beef560b
  20. 16 5月, 2014 1 次提交
  21. 22 4月, 2014 4 次提交
  22. 08 4月, 2014 1 次提交
    • A
      mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags" · 1e1836e8
      Alex Thorlton 提交于
      The main motivation behind this patch is to provide a way to disable THP
      for jobs where the code cannot be modified, and using a malloc hook with
      madvise is not an option (i.e.  statically allocated data).  This patch
      allows us to do just that, without affecting other jobs running on the
      system.
      
      We need to do this sort of thing for jobs where THP hurts performance,
      due to the possibility of increased remote memory accesses that can be
      created by situations such as the following:
      
      When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
      be handed out, and the THP will be stuck on whatever node the chunk was
      originally referenced from.  If many remote nodes need to do work on
      that same chunk, they'll be making remote accesses.
      
      With THP disabled, 4K pages can be handed out to separate nodes as
      they're needed, greatly reducing the amount of remote accesses to
      memory.
      
      This patch is based on some of my work combined with some
      suggestions/patches given by Oleg Nesterov.  The main goal here is to
      add a prctl switch to allow us to disable to THP on a per mm_struct
      basis.
      
      Here's a bit of test data with the new patch in place...
      
      First with the flag unset:
      
        # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 0
        Set thp_disabled state to 0
        Process pid = 18027
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
                 1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
                     7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
                 1,698,932 page-faults               #    0.000 M/sec
        355,222,544,890,379 cycles                   #    1.298 GHz                     [100.00%]
        536,445,412,234,588 stalled-cycles-frontend  #  151.02% frontend cycles idle    [100.00%]
        409,110,531,310,223 stalled-cycles-backend   #  115.17% backend  cycles idle    [100.00%]
        148,286,797,266,411 instructions             #    0.42  insns per cycle
                                                     #    3.62  stalled cycles per insn [100.00%]
        27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
             1,188,655,196 branch-misses             #    0.00% of all branches
      
             427.001706337 seconds time elapsed
      
      Now with the flag set:
      
        # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 1
        Set thp_disabled state to 1
        Process pid = 144957
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
                   534,205 context-switches          #    0.000 M/sec                   [100.00%]
                     4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
                63,133,119 page-faults               #    0.000 M/sec
        147,977,747,269,768 cycles                   #    1.066 GHz                     [100.00%]
        200,524,196,493,108 stalled-cycles-frontend  #  135.51% frontend cycles idle    [100.00%]
        105,175,163,716,388 stalled-cycles-backend   #   71.07% backend  cycles idle    [100.00%]
        180,916,213,503,160 instructions             #    1.22  insns per cycle
                                                     #    1.11  stalled cycles per insn [100.00%]
        26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
               714,066,351 branch-misses             #    0.00% of all branches
      
             216.196778807 seconds time elapsed
      
      As with previous versions of the patch, We're getting about a 2x
      performance increase here.  Here's a link to the test case I used, along
      with the little wrapper to activate the flag:
      
        http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz
      
      This patch (of 3):
      
      Revert commit 8e72033f and add in code to fix up any issues caused
      by the revert.
      
      The revert is necessary because hugepage_madvise would return -EINVAL
      when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
      patch set.
      
      Here's a snip of an e-mail from Gerald detailing the original purpose of
      this code, and providing justification for the revert:
      
        "The intent of commit 8e72033f was to guard against any future
         programming errors that may result in an madvice(MADV_HUGEPAGE) on
         guest mappings, which would crash the kernel.
      
         Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
         8e72033f was to be reverted, because that check will also prevent
         a kernel crash in the case described above, it will now send a
         SIGSEGV instead.
      
         This would now also allow to do the madvise on other parts, if
         needed, so it is a more flexible approach.  One could also say that
         it would have been better to do it this way right from the
         beginning..."
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e1836e8
  23. 03 4月, 2014 3 次提交
    • H
      s390/uaccess: rework uaccess code - fix locking issues · 457f2180
      Heiko Carstens 提交于
      The current uaccess code uses a page table walk in some circumstances,
      e.g. in case of the in atomic futex operations or if running on old
      hardware which doesn't support the mvcos instruction.
      
      However it turned out that the page table walk code does not correctly
      lock page tables when accessing page table entries.
      In other words: a different cpu may invalidate a page table entry while
      the current cpu inspects the pte. This may lead to random data corruption.
      
      Adding correct locking however isn't trivial for all uaccess operations.
      Especially copy_in_user() is problematic since that requires to hold at
      least two locks, but must be protected against ABBA deadlock when a
      different cpu also performs a copy_in_user() operation.
      
      So the solution is a different approach where we change address spaces:
      
      User space runs in primary address mode, or access register mode within
      vdso code, like it currently already does.
      
      The kernel usually also runs in home space mode, however when accessing
      user space the kernel switches to primary or secondary address mode if
      the mvcos instruction is not available or if a compare-and-swap (futex)
      instruction on a user space address is performed.
      KVM however is special, since that requires the kernel to run in home
      address space while implicitly accessing user space with the sie
      instruction.
      
      So we end up with:
      
      User space:
      - runs in primary or access register mode
      - cr1 contains the user asce
      - cr7 contains the user asce
      - cr13 contains the kernel asce
      
      Kernel space:
      - runs in home space mode
      - cr1 contains the user or kernel asce
        -> the kernel asce is loaded when a uaccess requires primary or
           secondary address mode
      - cr7 contains the user or kernel asce, (changed with set_fs())
      - cr13 contains the kernel asce
      
      In case of uaccess the kernel changes to:
      - primary space mode in case of a uaccess (copy_to_user) and uses
        e.g. the mvcp instruction to access user space. However the kernel
        will stay in home space mode if the mvcos instruction is available
      - secondary space mode in case of futex atomic operations, so that the
        instructions come from primary address space and data from secondary
        space
      
      In case of kvm the kernel runs in home space mode, but cr1 gets switched
      to contain the gmap asce before the sie instruction gets executed. When
      the sie instruction is finished cr1 will be switched back to contain the
      user asce.
      
      A context switch between two processes will always load the kernel asce
      for the next process in cr1. So the first exit to user space is a bit
      more expensive (one extra load control register instruction) than before,
      however keeps the code rather simple.
      
      In sum this means there is no need to perform any error prone page table
      walks anymore when accessing user space.
      
      The patch seems to be rather large, however it mainly removes the
      the page table walk code and restores the previously deleted "standard"
      uaccess code, with a couple of changes.
      
      The uaccess without mvcos mode can be enforced with the "uaccess_primary"
      kernel parameter.
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      457f2180
    • M
      s390/mm,tlb: optimize TLB flushing for zEC12 · 1b948d6c
      Martin Schwidefsky 提交于
      The zEC12 machines introduced the local-clearing control for the IDTE
      and IPTE instruction. If the control is set only the TLB of the local
      CPU is cleared of entries, either all entries of a single address space
      for IDTE, or the entry for a single page-table entry for IPTE.
      Without the local-clearing control the TLB flush is broadcasted to all
      CPUs in the configuration, which is expensive.
      
      The reset of the bit mask of the CPUs that need flushing after a
      non-local IDTE is tricky. As TLB entries for an address space remain
      in the TLB even if the address space is detached a new bit field is
      required to keep track of attached CPUs vs. CPUs in the need of a
      flush. After a non-local flush with IDTE the bit-field of attached CPUs
      is copied to the bit-field of CPUs in need of a flush. The ordering
      of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is
      such that an underindication in mm_cpumask(mm) is prevented but an
      overindication in mm_cpumask(mm) is possible.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      1b948d6c
    • M
      s390/mm,tlb: safeguard against speculative TLB creation · 02a8f3ab
      Martin Schwidefsky 提交于
      The principles of operations states that the CPU is allowed to create
      TLB entries for an address space anytime while an ASCE is loaded to
      the control register. This is true even if the CPU is running in the
      kernel and the user address space is not (actively) accessed.
      
      In theory this can affect two aspects of the TLB flush logic.
      For full-mm flushes the ASCE of the dying process is still attached.
      The approach to flush first with IDTE and then just free all page
      tables can in theory lead to stale TLB entries. Use the batched
      free of page tables for the full-mm flushes as well.
      
      For operations that can have a stale ASCE in the control register,
      e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE
      to prevent invalid TLBs from being created.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      02a8f3ab
  24. 21 3月, 2014 2 次提交