1. 29 5月, 2011 1 次提交
    • H
      [S390] mm: fix storage key handling · a43a9d93
      Heiko Carstens 提交于
      page_get_storage_key() and page_set_storage_key() expect a page address
      and not its page frame number. This got inconsistent with 2d42552d
      "[S390] merge page_test_dirty and page_clear_dirty".
      
      Result is that we read/write storage keys from random pages and do not
      have a working dirty bit tracking at all.
      E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
      for example ext4 doesn't like very much and panics after a while.
      
      Unable to handle kernel paging request at virtual user address (null)
      Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in:
      CPU: 1 Not tainted 2.6.39-07551-g139f37f5-dirty #152
      Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
      Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
                 0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
                 0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
                 0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
      Krnl Code: 0000000000316b04: f0a0000407f4       srp     4(11,%r0),2036,0
                 0000000000316b0a: b9020022           ltgr    %r2,%r2
                 0000000000316b0e: a7740015           brc     7,316b38
                >0000000000316b12: e3d0c0000024       stg     %r13,0(%r12)
                 0000000000316b18: 4120c010           la      %r2,16(%r12)
                 0000000000316b1c: 4130d060           la      %r3,96(%r13)
                 0000000000316b20: e340d0600004       lg      %r4,96(%r13)
                 0000000000316b26: c0e50002b567       brasl   %r14,36d5f4
      Call Trace:
      ([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
       [<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
       [<00000000002daac2>] ext4_da_writepages+0x2da/0x504
       [<00000000002597e8>] writeback_single_inode+0xf8/0x268
       [<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
       [<000000000025a700>] writeback_inodes_wb+0x80/0x168
       [<000000000025aa92>] wb_writeback+0x2aa/0x324
       [<000000000025abde>] wb_do_writeback+0xd2/0x274
       [<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
       [<00000000001737be>] kthread+0xa6/0xb0
       [<000000000056c1da>] kernel_thread_starter+0x6/0xc
       [<000000000056c1d4>] kernel_thread_starter+0x0/0xc
      INFO: lockdep is turned off.
      Last Breaking-Event-Address:
       [<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      a43a9d93
  2. 23 5月, 2011 3 次提交
    • M
      [S390] refactor page table functions for better pgste support · b2fa47e6
      Martin Schwidefsky 提交于
      Rework the architecture page table functions to access the bits in the
      page table extension array (pgste). There are a number of changes:
      1) Fix missing pgste update if the attach_count for the mm is <= 1.
      2) For every operation that affects the invalid bit in the pte or the
         rcp byte in the pgste the pcl lock needs to be acquired. The function
         pgste_get_lock gets the pcl lock and returns the current pgste value
         for a pte pointer. The function pgste_set_unlock stores the pgste
         and releases the lock. Between these two calls the bits in the pgste
         can be shuffled.
      3) Define two software bits in the pte _PAGE_SWR and _PAGE_SWC to avoid
         calling SetPageDirty and SetPageReferenced from pgtable.h. If the
         host reference backup bit or the host change backup bit has been
         set the dirty/referenced state is transfered to the pte. The common
         code will pick up the state from the pte.
      4) Add ptep_modify_prot_start and ptep_modify_prot_commit for mprotect.
      5) Remove pgd_populate_kernel, pud_populate_kernel, pmd_populate_kernel
         pgd_clear_kernel, pud_clear_kernel, pmd_clear_kernel and ptep_invalidate.
      6) Rename kvm_s390_test_and_clear_page_dirty to
         ptep_test_and_clear_user_dirty and add ptep_test_and_clear_user_young.
      7) Define mm_exclusive() and mm_has_pgste() helper to improve readability.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      b2fa47e6
    • M
      [S390] merge page_test_dirty and page_clear_dirty · 2d42552d
      Martin Schwidefsky 提交于
      The page_clear_dirty primitive always sets the default storage key
      which resets the access control bits and the fetch protection bit.
      That will surprise a KVM guest that sets non-zero access control
      bits or the fetch protection bit. Merge page_test_dirty and
      page_clear_dirty back to a single function and only clear the
      dirty bit from the storage key.
      
      In addition move the function page_test_and_clear_dirty and
      page_test_and_clear_young to page.h where they belong. This
      requires to change the parameter from a struct page * to a page
      frame number.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2d42552d
    • M
      [S390] Remove data execution protection · 043d0708
      Martin Schwidefsky 提交于
      The noexec support on s390 does not rely on a bit in the page table
      entry but utilizes the secondary space mode to distinguish between
      memory accesses for instructions vs. data. The noexec code relies
      on the assumption that the cpu will always use the secondary space
      page table for data accesses while it is running in the secondary
      space mode. Up to the z9-109 class machines this has been the case.
      Unfortunately this is not true anymore with z10 and later machines.
      The load-relative-long instructions lrl, lgrl and lgfrl access the
      memory operand using the same addressing-space mode that has been
      used to fetch the instruction.
      This breaks the noexec mode for all user space binaries compiled
      with march=z10 or later. The only option is to remove the current
      noexec support.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      043d0708
  3. 27 10月, 2010 1 次提交
  4. 25 10月, 2010 5 次提交
  5. 24 8月, 2010 1 次提交
    • M
      [S390] fix tlb flushing vs. concurrent /proc accesses · 050eef36
      Martin Schwidefsky 提交于
      The tlb flushing code uses the mm_users field of the mm_struct to
      decide if each page table entry needs to be flushed individually with
      IPTE or if a global flush for the mm_struct is sufficient after all page
      table updates have been done. The comment for mm_users says "How many
      users with user space?" but the /proc code increases mm_users after it
      found the process structure by pid without creating a new user process.
      Which makes mm_users useless for the decision between the two tlb
      flusing methods. The current code can be confused to not flush tlb
      entries by a concurrent access to /proc files if e.g. a fork is in
      progres. The solution for this problem is to make the tlb flushing
      logic independent from the mm_users field.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      050eef36
  6. 09 4月, 2010 1 次提交
  7. 21 2月, 2010 1 次提交
    • R
      MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself · 4b3073e1
      Russell King 提交于
      On VIVT ARM, when we have multiple shared mappings of the same file
      in the same MM, we need to ensure that we have coherency across all
      copies.  We do this via make_coherent() by making the pages
      uncacheable.
      
      This used to work fine, until we allowed highmem with highpte - we
      now have a page table which is mapped as required, and is not available
      for modification via update_mmu_cache().
      
      Ralf Beache suggested getting rid of the PTE value passed to
      update_mmu_cache():
      
        On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
        to construct a pointer to the pte again.  Passing a pte_t * is much
        more elegant.  Maybe we might even replace the pte argument with the
        pte_t?
      
      Ben Herrenschmidt would also like the pte pointer for PowerPC:
      
        Passing the ptep in there is exactly what I want.  I want that
        -instead- of the PTE value, because I have issue on some ppc cases,
        for I$/D$ coherency, where set_pte_at() may decide to mask out the
        _PAGE_EXEC.
      
      So, pass in the mapped page table pointer into update_mmu_cache(), and
      remove the PTE value, updating all implementations and call sites to
      suit.
      
      Includes a fix from Stephen Rothwell:
      
        sparc: fix fallout from update_mmu_cache API change
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      4b3073e1
  8. 07 12月, 2009 1 次提交
  9. 12 6月, 2009 1 次提交
  10. 27 11月, 2008 1 次提交
    • C
      [S390] pgtable.h: Fix oops in unmap_vmas for KVM processes · 2944a5c9
      Christian Borntraeger 提交于
      When running several kvm processes with lots of memory overcommitment,
      we have seen an oops during process shutdown:
      ------------[ cut here ]------------
      Kernel BUG at 0000000000193434 [verbose debug info unavailable]
      addressing exception: 0005 [#1] PREEMPT SMP
      Modules linked in: kvm sunrpc qeth_l2 dm_mod qeth ccwgroup
      CPU: 10 Not tainted 2.6.28-rc4-kvm-bigiron-00521-g0ccca08-dirty #8
      Process kuli (pid: 14460, task: 0000000149822338, ksp: 0000000024f57650)
      Krnl PSW : 0704e00180000000 0000000000193434 (unmap_vmas+0x884/0xf10)
      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
      Krnl GPRS: 0000000000000002 0000000000000000 000000051008d000 000003e05e6034e0
                 00000000001933f6 00000000000001e9 0000000407259e0a 00000002be88c400
                 00000200001c1000 0000000407259608 0000000407259e08 0000000024f577f0
                 0000000407259e09 0000000000445fa8 00000000001933f6 0000000024f577f0
      Krnl Code: 0000000000193426: eb22000c000d sllg %r2,%r2,12
                 000000000019342c: a7180000 lhi %r1,0
                 0000000000193430: b2290012 iske %r1,%r2
                >0000000000193434: a7110002 tmll %r1,2
                 0000000000193438: a7840006 brc 8,193444
                 000000000019343c: 9602c000 oi 0(%r12),2
                 0000000000193440: 96806000 oi 0(%r6),128
                 0000000000193444: a7110004 tmll %r1,4
      Call Trace:
      ([<00000000001933f6>] unmap_vmas+0x846/0xf10)
      [<0000000000199680>] exit_mmap+0x210/0x458
      [<000000000012a8f8>] mmput+0x54/0xfc
      [<000000000012f714>] exit_mm+0x134/0x144
      [<000000000013120c>] do_exit+0x240/0x878
      [<00000000001318dc>] do_group_exit+0x98/0xc8
      [<000000000013e6b0>] get_signal_to_deliver+0x30c/0x358
      [<000000000010bee0>] do_signal+0xec/0x860
      [<0000000000112e30>] sysc_sigpending+0xe/0x22
      [<000002000013198a>] 0x2000013198a
      INFO: lockdep is turned off.
      Last Breaking-Event-Address:
      [<00000000001a68d0>] free_swap_and_cache+0x1a0/0x1a4
      <4>---[ end trace bc19f1d51ac9db7c ]---
      
      The faulting instruction is the storage key operation (iske) in
      ptep_rcp_copy (called by pte_clear, called by unmap_vmas). iske
      reads dirty and reference bit information for a physical page and
      requires a valid physical address. Since we are in pte_clear, we
      cannot rely on the pte containing a valid address. Fortunately we
      dont need these information in pte_clear - after all there is no
      mapping. The best fix is to remove the needless call to ptep_rcp_copy
      that contains the iske.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2944a5c9
  11. 28 10月, 2008 1 次提交
    • C
      [S390] pgtables: Fix race in enable_sie vs. page table ops · 250cf776
      Christian Borntraeger 提交于
      The current enable_sie code sets the mm->context.pgstes bit to tell
      dup_mm that the new mm should have extended page tables. This bit is also
      used by the s390 specific page table primitives to decide about the page
      table layout - which means context.pgstes has two meanings. This can cause
      any kind of bugs. For example  - e.g. shrink_zone can call
      ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young
      will test for context.pgstes. Since enable_sie changed that value of the old
      struct mm without changing the page table layout ptep_clear_flush_young will
      do the wrong thing.
      The solution is to split pgstes into two bits
      - one for the allocation
      - one for the current state
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      250cf776
  12. 11 10月, 2008 1 次提交
  13. 02 8月, 2008 1 次提交
  14. 14 7月, 2008 1 次提交
  15. 08 7月, 2008 1 次提交
  16. 30 4月, 2008 2 次提交
  17. 28 4月, 2008 2 次提交
    • N
      s390: implement pte special bit · a08cb629
      Nick Piggin 提交于
      Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
      the user mappings.
      
      This requires the get_xip_page API to be changed to an address based one.
      Improve the API layering a little bit too, while we're here.
      
      This is required in order to support XIP filesystems on memory that isn't
      backed with struct page (but memory with struct page is still supported too).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a08cb629
    • N
      mm: introduce pte_special pte bit · 7e675137
      Nick Piggin 提交于
      s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
      model (which is more dynamic than most).  Instead, they had proposed to
      implement it with an additional path through vm_normal_page(), using a bit in
      the pte to determine whether or not the page should be refcounted:
      
      vm_normal_page()
      {
      	...
              if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                      if (vma->vm_flags & VM_MIXEDMAP) {
      #ifdef s390
      			if (!mixedmap_refcount_pte(pte))
      				return NULL;
      #else
                              if (!pfn_valid(pfn))
                                      return NULL;
      #endif
                              goto out;
                      }
      	...
      }
      
      This is fine, however if we are allowed to use a bit in the pte to determine
      refcountedness, we can use that to _completely_ replace all the vma based
      schemes.  So instead of adding more cases to the already complex vma-based
      scheme, we can have a clearly seperate and simple pte-based scheme (and get
      slightly better code generation in the process):
      
      vm_normal_page()
      {
      #ifdef s390
      	if (!mixedmap_refcount_pte(pte))
      		return NULL;
      	return pte_page(pte);
      #else
      	...
      #endif
      }
      
      And finally, we may rather make this concept usable by any architecture rather
      than making it s390 only, so implement a new type of pte state for this.
      Unfortunately the old vma based code must stay, because some architectures may
      not be able to spare pte bits.  This makes vm_normal_page a little bit more
      ugly than we would like, but the 2 cases are clearly seperate.
      
      So introduce a pte_special pte state, and use it in mm/memory.c.  It is
      currently a noop for all architectures, so this doesn't actually result in any
      compiled code changes to mm/memory.o.
      
      BTW:
      I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
      The reason is that, regardless of where vm_normal_page is actually
      implemented, the *abstraction* is still exactly the same. Also, while it
      depends on whether the architecture has pte_special or not, that is the
      only two possible cases, and it really isn't an arch specific function --
      the role of the arch code should be to provide primitive functions and
      accessors with which to build the core code; pte_special does that. We do
      not want architectures to know or care about vm_normal_page itself, and
      we definitely don't want them being able to invent something new there
      out of sight of mm/ code. If we made vm_normal_page an arch function, then
      we have to make vm_insert_mixed (next patch) an arch function too. So I
      don't think moving it to arch code fundamentally improves any abstractions,
      while it does practically make the code more difficult to follow, for both
      mm and arch developers, and easier to misuse.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e675137
  18. 27 4月, 2008 3 次提交
  19. 10 2月, 2008 4 次提交
  20. 05 2月, 2008 1 次提交
    • H
      [S390] Remove BUILD_BUG_ON() in vmem code. · 0189103c
      Heiko Carstens 提交于
      Remove BUILD_BUG_ON() in vmem code since it causes build failures if
      the size of struct page increases. Instead calculate at compile time
      the address of the highest physical address that can be added to the
      1:1 mapping.
      This supposed to fix a build failure with the page owner tracking leak
      detector patches as reported by akpm.
      
      page-owner-tracking-leak-detector-broken-on-s390.patch can be removed
      from -mm again when this is merged.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0189103c
  21. 26 1月, 2008 1 次提交
    • C
      [S390] Change vmalloc defintions · 5fd9c6e2
      Christian Borntraeger 提交于
      Currently the vmalloc area starts at a dynamic address depending on
      the memory size. There was also an 8MB security hole after the
      physical memory to catch out-of-bounds accesses.
      We can simplify the code by putting the vmalloc area explicitely at
      the top of the kernel mapping and setting the vmalloc size to a fixed
      value of 128MB/128GB for 31bit/64bit systems. Part of the vmalloc
      area will be used for the vmem_map. This leaves an area of 96MB/1GB
      for normal vmalloc allocations.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      5fd9c6e2
  22. 17 12月, 2007 1 次提交
    • M
      [S390] pud_present/pmd_present bug. · 0d017923
      Martin Schwidefsky 提交于
      Git commit 3610cce8 (yeah my own :-/)
      introduced a bug in regard to pud/pmd table entries.
      If the address of the page table refered to by a pud/pmd value happens
      to have zeroes in the lower 32 bits, pud_present and pmd_present return
      false. The obvious effect is that this triggers the BUG_ON in exit_mmap
      because some ptes will not get released on process end.  Worse is that
      the next fault for memory covered by that pud/pmd will allocate another
      pmd/pte table and populate the pud/pmd entry. The old page table
      entries hanging below this entry are lost!
      
      The fix is simple, properly check against 0. The check is added for
      pud_none/pmd_none as well even if these two functions work because
      the invalid bit is in the lower 32 bits.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0d017923
  23. 22 10月, 2007 3 次提交
    • M
      [S390] 4level-fixup cleanup · 190a1d72
      Martin Schwidefsky 提交于
      Get independent from asm-generic/4level-fixup.h
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      190a1d72
    • M
      [S390] Cleanup page table definitions. · 3610cce8
      Martin Schwidefsky 提交于
      - De-confuse the defines for the address-space-control-elements
        and the segment/region table entries.
      - Create out of line functions for page table allocation / freeing.
      - Simplify get_shadow_xxx functions.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      3610cce8
    • M
      [S390] tlb flush fix. · ba8a9229
      Martin Schwidefsky 提交于
      The current tlb flushing code for page table entries violates the
      s390 architecture in a small detail. The relevant section from the
      principles of operation (SA22-7832-02 page 3-47):
      
         "A valid table entry must not be changed while it is attached
         to any CPU and may be used for translation by that CPU except to
         (1) invalidate the entry by using INVALIDATE PAGE TABLE ENTRY or
         INVALIDATE DAT TABLE ENTRY, (2) alter bits 56-63 of a page-table
         entry, or (3) make a change by means of a COMPARE AND SWAP AND
         PURGE instruction that purges the TLB."
      
      That means if one thread of a multithreaded applciation uses a vma
      while another thread does an unmap on it, the page table entries of
      that vma needs to get removed with IPTE, IDTE or CSP. In some strange
      and rare situations a cpu could check-stop (die) because a entry has
      been pushed out of the TLB that is still needed to complete a
      (milli-coded) instruction. I've never seen it happen with the current
      code on any of the supported machines, so right now this is a
      theoretical problem. But I want to fix it nevertheless, to avoid
      headaches in the futures.
      
      To get this implemented correctly without changing common code the
      primitives ptep_get_and_clear, ptep_get_and_clear_full and
      ptep_set_wrprotect need to use the IPTE instruction to invalidate the
      pte before the new pte value gets stored. If IPTE is always used for
      the three primitives three important operations will have a performace
      hit: fork, mprotect and exit_mmap. Time for some workarounds:
      
      * 1: ptep_get_and_clear_full is used in unmap_vmas to remove page
      tables entries in a batched tlb gather operation. If the mmu_gather
      context passed to unmap_vmas has been started with full_mm_flush==1
      or if only one cpu is online or if the only user of a mm_struct is the
      current process then the fullmm indication in the mmu_gather context is
      set to one. All TLBs for mm_struct are flushed by the tlb_gather_mmu
      call. No new TLBs can be created while the unmap is in progress. In
      this case ptep_get_and_clear_full clears the ptes with a simple store.
      
      * 2: ptep_get_and_clear is used in change_protection to clear the
      ptes from the page tables before they are reentered with the new
      access flags. At the end of the update flush_tlb_range clears the
      remaining TLBs. In general the ptep_get_and_clear has to issue IPTE
      for each pte and flush_tlb_range is a nop. But if there is only one
      user of the mm_struct then ptep_get_and_clear uses simple stores
      to do the update and flush_tlb_range will flush the TLBs.
      
      * 3: Similar to 2, ptep_set_wrprotect is used in copy_page_range
      for a fork to make all ptes of a cow mapping read-only. At the end of
      of copy_page_range dup_mmap will flush the TLBs with a call to
      flush_tlb_mm.  Check for mm->mm_users and if there is only one user
      avoid using IPTE in ptep_set_wrprotect and let flush_tlb_mm clear the
      TLBs.
      
      Overall for single threaded programs the tlb flush code now performs
      better, for multi threaded programs it is slightly worse. In particular
      exit_mmap() now does a single IDTE for the mm and then just frees every
      page cache reference and every page table page directly without a delay
      over the mmu_gather structure.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      ba8a9229
  24. 12 10月, 2007 1 次提交
  25. 18 7月, 2007 1 次提交