1. 10 8月, 2016 1 次提交
    • H
      s390/pageattr: handle numpages parameter correctly · 4d81aaa5
      Heiko Carstens 提交于
      Both set_memory_ro() and set_memory_rw() will modify the page
      attributes of at least one page, even if the numpages parameter is
      zero.
      
      The author expected that calling these functions with numpages == zero
      would never happen. However with the new 444d13ff ("modules: add
      ro_after_init support") feature this happens frequently.
      
      Therefore do the right thing and make these two functions return
      gracefully if nothing should be done.
      
      Fixes crashes on module load like this one:
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 000003ff80008000 TEID: 000003ff80008407
      Fault in home space mode while using kernel ASCE.
      AS:0000000000d18007 R3:00000001e6aa4007 S:00000001e6a10800 P:00000001e34ee21d
      Oops: 0004 ilc:3 [#1] SMP
      Modules linked in: x_tables
      CPU: 10 PID: 1 Comm: systemd Not tainted 4.7.0-11895-g3fa9045 #4
      Hardware name: IBM              2964 N96              703              (LPAR)
      task: 00000001e9118000 task.stack: 00000001e9120000
      Krnl PSW : 0704e00180000000 00000000005677f8 (rb_erase+0xf0/0x4d0)
                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 000003ff80008b20 000003ff80008b20 000003ff80008b70 0000000000b9d608
                 000003ff80008b20 0000000000000000 00000001e9123e88 000003ff80008950
                 00000001e485ab40 000003ff00000000 000003ff80008b00 00000001e4858480
                 0000000100000000 000003ff80008b68 00000000001d5998 00000001e9123c28
      Krnl Code: 00000000005677e8: ec1801c3007c        cgij    %r1,0,8,567b6e
                 00000000005677ee: e32010100020        cg      %r2,16(%r1)
                #00000000005677f4: a78401c2            brc     8,567b78
                >00000000005677f8: e35010080024        stg     %r5,8(%r1)
                 00000000005677fe: ec5801af007c        cgij    %r5,0,8,567b5c
                 0000000000567804: e30050000024        stg     %r0,0(%r5)
                 000000000056780a: ebacf0680004        lmg     %r10,%r12,104(%r15)
                 0000000000567810: 07fe                bcr     15,%r14
      Call Trace:
      ([<000003ff80008900>] __this_module+0x0/0xffffffffffffd700 [x_tables])
      ([<0000000000264fd4>] do_init_module+0x12c/0x220)
      ([<00000000001da14a>] load_module+0x24e2/0x2b10)
      ([<00000000001da976>] SyS_finit_module+0xbe/0xd8)
      ([<0000000000803b26>] system_call+0xd6/0x264)
      Last Breaking-Event-Address:
       [<000000000056771a>] rb_erase+0x12/0x4d0
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reported-and-tested-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Fixes: e8a97e42 ("s390/pageattr: allow kernel page table splitting")
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      4d81aaa5
  2. 31 7月, 2016 1 次提交
    • G
      s390/mm: clean up pte/pmd encoding · bc29b7ac
      Gerald Schaefer 提交于
      The hugetlbfs pte<->pmd conversion functions currently assume that the pmd
      bit layout is consistent with the pte layout, which is not really true.
      
      The SW read and write bits are encoded as the sequence "wr" in a pte, but
      in a pmd it is "rw". The hugetlbfs conversion assumes that the sequence
      is identical in both cases, which results in swapped read and write bits
      in the pmd. In practice this is not a problem, because those pmd bits are
      only relevant for THP pmds and not for hugetlbfs pmds. The hugetlbfs code
      works on (fake) ptes, and the converted pte bits are correct.
      
      There is another variation in pte/pmd encoding which affects dirty
      prot-none ptes/pmds. In this case, a pmd has both its HW read-only and
      invalid bit set, while it is only the invalid bit for a pte. This also has
      no effect in practice, but it should better be consistent.
      
      This patch fixes both inconsistencies by changing the SW read/write bit
      layout for pmds as well as the PAGE_NONE encoding for ptes. It also makes
      the hugetlbfs conversion functions more robust by introducing a
      move_set_bit() macro that uses the pte/pmd bit #defines instead of
      constant shifts.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      bc29b7ac
  3. 27 7月, 2016 1 次提交
  4. 13 7月, 2016 1 次提交
  5. 06 7月, 2016 1 次提交
  6. 28 6月, 2016 1 次提交
  7. 25 6月, 2016 1 次提交
  8. 20 6月, 2016 18 次提交
  9. 14 6月, 2016 1 次提交
    • H
      s390/mm: fix compile for PAGE_DEFAULT_KEY != 0 · de3fa841
      Heiko Carstens 提交于
      The usual problem for code that is ifdef'ed out is that it doesn't
      compile after a while. That's also the case for the storage key
      initialisation code, if it would be used (set PAGE_DEFAULT_KEY to
      something not zero):
      
      ./arch/s390/include/asm/page.h: In function 'storage_key_init_range':
      ./arch/s390/include/asm/page.h:36:2: error: implicit declaration of function '__storage_key_init_range'
      
      Since the code itself has been useful for debugging purposes several
      times, remove the ifdefs and make sure the code gets compiler
      coverage. The cost for this is eight bytes.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      de3fa841
  10. 13 6月, 2016 14 次提交
    • H
      s390: avoid extable collisions · 6c22c986
      Heiko Carstens 提交于
      We have some inline assemblies where the extable entry points to a
      label at the end of an inline assembly which is not followed by an
      instruction.
      
      On the other hand we have also inline assemblies where the extable
      entry points to the first instruction of an inline assembly.
      
      If a first type inline asm (extable point to empty label at the end)
      would be directly followed by a second type inline asm (extable points
      to first instruction) then we would have two different extable entries
      that point to the same instruction but would have a different target
      address.
      
      This can lead to quite random behaviour, depending on sorting order.
      
      I verified that we currently do not have such collisions within the
      kernel. However to avoid such subtle bugs add a couple of nop
      instructions to those inline assemblies which contain an extable that
      points to an empty label.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      6c22c986
    • H
      s390: add proper __ro_after_init support · d07a980c
      Heiko Carstens 提交于
      On s390 __ro_after_init is currently mapped to __read_mostly which
      means that data marked as __ro_after_init will not be protected.
      
      Reason for this is that the common code __ro_after_init implementation
      is x86 centric: the ro_after_init data section was added to rodata,
      since x86 enables write protection to kernel text and rodata very
      late. On s390 we have write protection for these sections enabled with
      the initial page tables. So adding the ro_after_init data section to
      rodata does not work on s390.
      
      In order to make __ro_after_init work properly on s390 move the
      ro_after_init data, right behind rodata. Unlike the rodata section it
      will be marked read-only later after all init calls happened.
      
      This s390 specific implementation adds new __start_ro_after_init and
      __end_ro_after_init labels. Everything in between will be marked
      read-only after the init calls happened. In addition to the
      __ro_after_init data move also the exception table there, since from a
      practical point of view it fits the __ro_after_init requirements.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      d07a980c
    • M
      s390/mm: simplify the TLB flushing code · 64f31d58
      Martin Schwidefsky 提交于
      ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
      decide between a lazy TLB flush vs an immediate TLB flush. The field
      contains two 16-bit counters, the number of CPUs that have the mm
      attached and can create TLB entries for it and the number of CPUs in
      the middle of a page table update.
      
      The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
      use the attach counter and a mask check with mm_cpumask(mm) to decide
      between a local flush local of the current CPU and a global flush.
      
      For all these functions the decision between lazy vs immediate and
      local vs global TLB flush can be based on CPU masks. There are two
      masks:  the mm->context.cpu_attach_mask with the CPUs that are actively
      using the mm, and the mm_cpumask(mm) with the CPUs that have used the
      mm since the last full flush. The decision between lazy vs immediate
      flush is based on the mm->context.cpu_attach_mask, to decide between
      local vs global flush the mm_cpumask(mm) is used.
      
      With this patch all checks will use the CPU masks, the old counter
      mm->context.attach_count with its two 16-bit values is turned into a
      single counter mm->context.flush_count that keeps track of the number
      of CPUs with incomplete page table updates. The sole user of this
      counter is finish_arch_post_lock_switch() which waits for the end of
      all page table updates.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      64f31d58
    • M
      s390/mm: fix vunmap vs finish_arch_post_lock_switch · a9809407
      Martin Schwidefsky 提交于
      The vunmap_pte_range() function calls ptep_get_and_clear() without any
      locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct()
      for the page table update. ptep_flush_direct requires that preemption
      is disabled, but without any locking this is not the case. If the kernel
      preempts the task while the attach_counter is increased an endless loop
      in finish_arch_post_lock_switch() will occur the next time the task is
      scheduled.
      
      Add explicit preempt_disable()/preempt_enable() calls to the relevant
      functions in arch/s390/mm/pgtable.c.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      a9809407
    • H
      s390/mm: align swapper_pg_dir to 16k · 0ccb32c9
      Heiko Carstens 提交于
      The segment/region table that is part of the kernel image must be
      properly aligned to 16k in order to make the crdte inline assembly
      work.
      Otherwise it will calculate a wrong segment/region table start address
      and access incorrect memory locations if the swapper_pg_dir is not
      aligned to 16k.
      
      Therefore define BSS_FIRST_SECTIONS in order to put the swapper_pg_dir
      at the beginning of the bss section and also align the bss section to
      16k just like other architectures did.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0ccb32c9
    • H
      s390/pgtable: add mapping statistics · 37cd944c
      Heiko Carstens 提交于
      Add statistics that show how memory is mapped within the kernel
      identity mapping. This is more or less the same like git
      commit ce0c0e50 ("x86, generic: CPA add statistics about state
      of direct mapping v4") for x86.
      
      I also intentionally copied the lower case "k" within DirectMap4k vs
      the upper case "M" and "G" within the two other lines. Let's have
      consistent inconsistencies across architectures.
      
      The output of /proc/meminfo now contains these additional lines:
      
      DirectMap4k:        2048 kB
      DirectMap1M:     3991552 kB
      DirectMap2G:     4194304 kB
      
      The implementation on s390 is lockless unlike the x86 version, since I
      assume changes to the kernel mapping are a very rare event. Therefore
      it really doesn't matter if these statistics could potentially be
      inconsistent if read while kernel pages tables are being changed.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      37cd944c
    • H
      s390/vmem: simplify vmem code for read-only mappings · bab247ff
      Heiko Carstens 提交于
      For the kernel identity mapping map everything read-writeable and
      subsequently call set_memory_ro() to make the ro section read-only.
      This simplifies the code a lot.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      bab247ff
    • H
      s390/pageattr: allow kernel page table splitting · e8a97e42
      Heiko Carstens 提交于
      set_memory_ro() and set_memory_rw() currently only work on 4k
      mappings, which is good enough for module code aka the vmalloc area.
      
      However we stumbled already twice into the need to make this also work
      on larger mappings:
      - the ro after init patch set
      - the crash kernel resize code
      
      Therefore this patch implements automatic kernel page table splitting
      if e.g. set_memory_ro() would be called on parts of a 2G mapping.
      This works quite the same as the x86 code, but is much simpler.
      
      In order to make this work and to be architecturally compliant we now
      always use the csp, cspg or crdte instructions to replace valid page
      table entries. This means that set_memory_ro() and set_memory_rw()
      will be much more expensive than before. In order to avoid huge
      latencies the code contains a couple of cond_resched() calls.
      
      The current code only splits page tables, but does not merge them if
      it would be possible.  The reason for this is that currently there is
      no real life scenarion where this would really happen. All current use
      cases that I know of only change access rights once during the life
      time. If that should change we can still implement kernel page table
      merging at a later time.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      e8a97e42
    • H
      s390/mm: always use PAGE_KERNEL when mapping pages · 3e76ee99
      Heiko Carstens 提交于
      Always use PAGE_KERNEL when re-enabling pages within the kernel
      mapping due to debug pagealloc. Without using this pgprot value
      pte_mkwrite() and pte_wrprotect() won't work on such mappings after an
      unmap -> map cycle anymore.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      3e76ee99
    • H
      s390/vmem: make use of pte_clear() · 5aa29975
      Heiko Carstens 提交于
      Use pte_clear() instead of open-coding it.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      5aa29975
    • H
      s390/pgtable: get rid of _REGION3_ENTRY_RO · c126aa83
      Heiko Carstens 提交于
      _REGION3_ENTRY_RO is a duplicate of _REGION_ENTRY_PROTECT.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      c126aa83
    • H
      s390/vmem: introduce and use SEGMENT_KERNEL and REGION3_KERNEL · 2dffdcba
      Heiko Carstens 提交于
      Instead of open-coded SEGMENT_KERNEL and REGION3_KERNEL assignments use
      defines.  Also to make e.g. pmd_wrprotect() work on the kernel mapping
      a couple more flags must be set. Therefore add the missing flags also.
      
      In order to make everything symmetrical this patch also adds software
      dirty, young, read and write bits for region 3 table entries.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2dffdcba
    • H
      s390/vmem: align segment and region tables to 16k · 2e9996fc
      Heiko Carstens 提交于
      Usually segment and region tables are 16k aligned due to the way the
      buddy allocator works.  This is not true for the vmem code which only
      asks for a 4k alignment. In order to be consistent enforce a 16k
      alignment here as well.
      
      This alignment will be assumed and therefore is required by the
      pageattr code.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2e9996fc
    • C
      KVM: s390/mm: Fix CMMA reset during reboot · 1c343f7b
      Christian Borntraeger 提交于
      commit 1e133ab2 ("s390/mm: split arch/s390/mm/pgtable.c") factored
      out the page table handling code from __gmap_zap and  __s390_reset_cmma
      into ptep_zap_unused and added a simple flag that tells which one of the
      function (reset or not) is to be made. This also changed the behaviour,
      as it also zaps unused page table entries on reset.
      Turns out that this is wrong as s390_reset_cmma uses the page walker,
      which DOES NOT take the ptl lock.
      
      The most simple fix is to not do the zapping part on reset (which uses
      the walker)
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Fixes: 1e133ab2 ("s390/mm: split arch/s390/mm/pgtable.c")
      Cc: stable@vger.kernel.org # 4.6+
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      1c343f7b