1. 06 6月, 2018 1 次提交
    • N
      powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap · ff5bc793
      Nicholas Piggin 提交于
      There is a typo in f1cb8f9b ("powerpc/64s/radix: avoid ptesync after
      set_pte and ptep_set_access_flags") config ifdef, which results in the
      necessary ptesync not being issued after vmalloc.
      
      This causes random kernel faults in module load, bpf load, anywhere
      that vmalloc mappings are used.
      
      After correcting the code, this survives a guest kernel booting
      hundreds of times where previously there would be a crash every few
      boots (I haven't noticed the crash on host, perhaps due to different
      TLB and page table walking behaviour in hardware).
      
      A memory clobber is also added to the flush, just to be sure it won't
      be reordered with the pte set or the subsequent mapping access.
      
      Fixes: f1cb8f9b ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ff5bc793
  2. 03 6月, 2018 39 次提交
    • C
      powerpc/time: inline arch_vtime_task_switch() · 60f1d289
      Christophe Leroy 提交于
      arch_vtime_task_switch() is a small function which is called
      only from vtime_common_task_switch(), so it is worth inlining
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      60f1d289
    • C
      powerpc: Implement csum_ipv6_magic in assembly · e9c4943a
      Christophe Leroy 提交于
      The generic csum_ipv6_magic() generates a pretty bad result
      
      00000000 <csum_ipv6_magic>: (PPC32)
         0:	81 23 00 00 	lwz     r9,0(r3)
         4:	81 03 00 04 	lwz     r8,4(r3)
         8:	7c e7 4a 14 	add     r7,r7,r9
         c:	7d 29 38 10 	subfc   r9,r9,r7
        10:	7d 4a 51 10 	subfe   r10,r10,r10
        14:	7d 27 42 14 	add     r9,r7,r8
        18:	7d 2a 48 50 	subf    r9,r10,r9
        1c:	80 e3 00 08 	lwz     r7,8(r3)
        20:	7d 08 48 10 	subfc   r8,r8,r9
        24:	7d 4a 51 10 	subfe   r10,r10,r10
        28:	7d 29 3a 14 	add     r9,r9,r7
        2c:	81 03 00 0c 	lwz     r8,12(r3)
        30:	7d 2a 48 50 	subf    r9,r10,r9
        34:	7c e7 48 10 	subfc   r7,r7,r9
        38:	7d 4a 51 10 	subfe   r10,r10,r10
        3c:	7d 29 42 14 	add     r9,r9,r8
        40:	7d 2a 48 50 	subf    r9,r10,r9
        44:	80 e4 00 00 	lwz     r7,0(r4)
        48:	7d 08 48 10 	subfc   r8,r8,r9
        4c:	7d 4a 51 10 	subfe   r10,r10,r10
        50:	7d 29 3a 14 	add     r9,r9,r7
        54:	7d 2a 48 50 	subf    r9,r10,r9
        58:	81 04 00 04 	lwz     r8,4(r4)
        5c:	7c e7 48 10 	subfc   r7,r7,r9
        60:	7d 4a 51 10 	subfe   r10,r10,r10
        64:	7d 29 42 14 	add     r9,r9,r8
        68:	7d 2a 48 50 	subf    r9,r10,r9
        6c:	80 e4 00 08 	lwz     r7,8(r4)
        70:	7d 08 48 10 	subfc   r8,r8,r9
        74:	7d 4a 51 10 	subfe   r10,r10,r10
        78:	7d 29 3a 14 	add     r9,r9,r7
        7c:	7d 2a 48 50 	subf    r9,r10,r9
        80:	81 04 00 0c 	lwz     r8,12(r4)
        84:	7c e7 48 10 	subfc   r7,r7,r9
        88:	7d 4a 51 10 	subfe   r10,r10,r10
        8c:	7d 29 42 14 	add     r9,r9,r8
        90:	7d 2a 48 50 	subf    r9,r10,r9
        94:	7d 08 48 10 	subfc   r8,r8,r9
        98:	7d 4a 51 10 	subfe   r10,r10,r10
        9c:	7d 29 2a 14 	add     r9,r9,r5
        a0:	7d 2a 48 50 	subf    r9,r10,r9
        a4:	7c a5 48 10 	subfc   r5,r5,r9
        a8:	7c 63 19 10 	subfe   r3,r3,r3
        ac:	7d 29 32 14 	add     r9,r9,r6
        b0:	7d 23 48 50 	subf    r9,r3,r9
        b4:	7c c6 48 10 	subfc   r6,r6,r9
        b8:	7c 63 19 10 	subfe   r3,r3,r3
        bc:	7c 63 48 50 	subf    r3,r3,r9
        c0:	54 6a 80 3e 	rotlwi  r10,r3,16
        c4:	7c 63 52 14 	add     r3,r3,r10
        c8:	7c 63 18 f8 	not     r3,r3
        cc:	54 63 84 3e 	rlwinm  r3,r3,16,16,31
        d0:	4e 80 00 20 	blr
      
      0000000000000000 <.csum_ipv6_magic>: (PPC64)
         0:	81 23 00 00 	lwz     r9,0(r3)
         4:	80 03 00 04 	lwz     r0,4(r3)
         8:	81 63 00 08 	lwz     r11,8(r3)
         c:	7c e7 4a 14 	add     r7,r7,r9
        10:	7f 89 38 40 	cmplw   cr7,r9,r7
        14:	7d 47 02 14 	add     r10,r7,r0
        18:	7d 30 10 26 	mfocrf  r9,1
        1c:	55 29 f7 fe 	rlwinm  r9,r9,30,31,31
        20:	7d 4a 4a 14 	add     r10,r10,r9
        24:	7f 80 50 40 	cmplw   cr7,r0,r10
        28:	7d 2a 5a 14 	add     r9,r10,r11
        2c:	80 03 00 0c 	lwz     r0,12(r3)
        30:	81 44 00 00 	lwz     r10,0(r4)
        34:	7d 10 10 26 	mfocrf  r8,1
        38:	55 08 f7 fe 	rlwinm  r8,r8,30,31,31
        3c:	7d 29 42 14 	add     r9,r9,r8
        40:	81 04 00 04 	lwz     r8,4(r4)
        44:	7f 8b 48 40 	cmplw   cr7,r11,r9
        48:	7d 29 02 14 	add     r9,r9,r0
        4c:	7d 70 10 26 	mfocrf  r11,1
        50:	55 6b f7 fe 	rlwinm  r11,r11,30,31,31
        54:	7d 29 5a 14 	add     r9,r9,r11
        58:	7f 80 48 40 	cmplw   cr7,r0,r9
        5c:	7d 29 52 14 	add     r9,r9,r10
        60:	7c 10 10 26 	mfocrf  r0,1
        64:	54 00 f7 fe 	rlwinm  r0,r0,30,31,31
        68:	7d 69 02 14 	add     r11,r9,r0
        6c:	7f 8a 58 40 	cmplw   cr7,r10,r11
        70:	7c 0b 42 14 	add     r0,r11,r8
        74:	81 44 00 08 	lwz     r10,8(r4)
        78:	7c f0 10 26 	mfocrf  r7,1
        7c:	54 e7 f7 fe 	rlwinm  r7,r7,30,31,31
        80:	7c 00 3a 14 	add     r0,r0,r7
        84:	7f 88 00 40 	cmplw   cr7,r8,r0
        88:	7d 20 52 14 	add     r9,r0,r10
        8c:	80 04 00 0c 	lwz     r0,12(r4)
        90:	7d 70 10 26 	mfocrf  r11,1
        94:	55 6b f7 fe 	rlwinm  r11,r11,30,31,31
        98:	7d 29 5a 14 	add     r9,r9,r11
        9c:	7f 8a 48 40 	cmplw   cr7,r10,r9
        a0:	7d 29 02 14 	add     r9,r9,r0
        a4:	7d 70 10 26 	mfocrf  r11,1
        a8:	55 6b f7 fe 	rlwinm  r11,r11,30,31,31
        ac:	7d 29 5a 14 	add     r9,r9,r11
        b0:	7f 80 48 40 	cmplw   cr7,r0,r9
        b4:	7d 29 2a 14 	add     r9,r9,r5
        b8:	7c 10 10 26 	mfocrf  r0,1
        bc:	54 00 f7 fe 	rlwinm  r0,r0,30,31,31
        c0:	7d 29 02 14 	add     r9,r9,r0
        c4:	7f 85 48 40 	cmplw   cr7,r5,r9
        c8:	7c 09 32 14 	add     r0,r9,r6
        cc:	7d 50 10 26 	mfocrf  r10,1
        d0:	55 4a f7 fe 	rlwinm  r10,r10,30,31,31
        d4:	7c 00 52 14 	add     r0,r0,r10
        d8:	7f 80 30 40 	cmplw   cr7,r0,r6
        dc:	7d 30 10 26 	mfocrf  r9,1
        e0:	55 29 ef fe 	rlwinm  r9,r9,29,31,31
        e4:	7c 09 02 14 	add     r0,r9,r0
        e8:	54 03 80 3e 	rotlwi  r3,r0,16
        ec:	7c 03 02 14 	add     r0,r3,r0
        f0:	7c 03 00 f8 	not     r3,r0
        f4:	78 63 84 22 	rldicl  r3,r3,48,48
        f8:	4e 80 00 20 	blr
      
      This patch implements it in assembly for both PPC32 and PPC64
      
      Link: https://github.com/linuxppc/linux/issues/9Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e9c4943a
    • C
      powerpc/lib: Adjust .balign inside string functions for PPC32 · 1128bb78
      Christophe Leroy 提交于
      commit 87a156fb ("Align hot loops of some string functions")
      degraded the performance of string functions by adding useless
      nops
      
      A simple benchmark on an 8xx calling 100000x a memchr() that
      matches the first byte runs in 41668 TB ticks before this patch
      and in 35986 TB ticks after this patch. So this gives an
      improvement of approx 10%
      
      Another benchmark doing the same with a memchr() matching the 128th
      byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
      after this patch, so regardless on the number of loops, removing
      those useless nops improves the test by 5683 TB ticks.
      
      Fixes: 87a156fb ("Align hot loops of some string functions")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1128bb78
    • C
      powerpc/64: optimises from64to32() · 55a0edf0
      Christophe Leroy 提交于
      The current implementation of from64to32() gives a poor result:
      
      0000000000000270 <.from64to32>:
       270:	38 00 ff ff 	li      r0,-1
       274:	78 69 00 22 	rldicl  r9,r3,32,32
       278:	78 00 00 20 	clrldi  r0,r0,32
       27c:	7c 60 00 38 	and     r0,r3,r0
       280:	7c 09 02 14 	add     r0,r9,r0
       284:	78 09 00 22 	rldicl  r9,r0,32,32
       288:	7c 00 4a 14 	add     r0,r0,r9
       28c:	78 03 00 20 	clrldi  r3,r0,32
       290:	4e 80 00 20 	blr
      
      This patch modifies from64to32() to operate in the same
      spirit as csum_fold()
      
      It swaps the two 32-bit halves of sum then it adds it with the
      unswapped sum. If there is a carry from adding the two 32-bit halves,
      it will carry from the lower half into the upper half, giving us the
      correct sum in the upper half.
      
      The resulting code is:
      
      0000000000000260 <.from64to32>:
       260:	78 60 00 02 	rotldi  r0,r3,32
       264:	7c 60 1a 14 	add     r3,r0,r3
       268:	78 63 00 22 	rldicl  r3,r3,32,32
       26c:	4e 80 00 20 	blr
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      55a0edf0
    • R
      powerpc/sstep: Introduce GETTYPE macro · e6684d07
      Ravi Bangoria 提交于
      Replace 'op->type & INSTR_TYPE_MASK' expression with GETTYPE(op->type)
      macro.
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e6684d07
    • M
      powerpc: Use barrier_nospec in copy_from_user() · ddf35cf3
      Michael Ellerman 提交于
      Based on the x86 commit doing the same.
      
      See commit 304ec1b0 ("x86/uaccess: Use __uaccess_begin_nospec()
      and uaccess_try_nospec") and b3bbfb3f ("x86: Introduce
      __uaccess_begin_nospec() and uaccess_try_nospec") for more detail.
      
      In all cases we are ordering the load from the potentially
      user-controlled pointer vs a previous branch based on an access_ok()
      check or similar.
      
      Base on a patch from Michal Suchanek.
      Signed-off-by: NMichal Suchanek <msuchanek@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ddf35cf3
    • M
      powerpc/64s: Enable barrier_nospec based on firmware settings · cb3d6759
      Michal Suchanek 提交于
      Check what firmware told us and enable/disable the barrier_nospec as
      appropriate.
      
      We err on the side of enabling the barrier, as it's no-op on older
      systems, see the comment for more detail.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      cb3d6759
    • M
      powerpc/64s: Patch barrier_nospec in modules · 815069ca
      Michal Suchanek 提交于
      Note that unlike RFI which is patched only in kernel the nospec state
      reflects settings at the time the module was loaded.
      
      Iterating all modules and re-patching every time the settings change
      is not implemented.
      
      Based on lwsync patching.
      Signed-off-by: NMichal Suchanek <msuchanek@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      815069ca
    • M
      powerpc/64s: Add support for ori barrier_nospec patching · 2eea7f06
      Michal Suchanek 提交于
      Based on the RFI patching. This is required to be able to disable the
      speculation barrier.
      
      Only one barrier type is supported and it does nothing when the
      firmware does not enable it. Also re-patching modules is not supported
      So the only meaningful thing that can be done is patching out the
      speculation barrier at boot when the user says it is not wanted.
      Signed-off-by: NMichal Suchanek <msuchanek@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2eea7f06
    • M
      powerpc/64s: Add barrier_nospec · a6b3964a
      Michal Suchanek 提交于
      A no-op form of ori (or immediate of 0 into r31 and the result stored
      in r31) has been re-tasked as a speculation barrier. The instruction
      only acts as a barrier on newer machines with appropriate firmware
      support. On older CPUs it remains a harmless no-op.
      
      Implement barrier_nospec using this instruction.
      
      mpe: The semantics of the instruction are believed to be that it
      prevents execution of subsequent instructions until preceding branches
      have been fully resolved and are no longer executing speculatively.
      There is no further documentation available at this time.
      Signed-off-by: NMichal Suchanek <msuchanek@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a6b3964a
    • M
      powerpc/64s: Wire up arch_trigger_cpumask_backtrace() · 5cc05910
      Michael Ellerman 提交于
      This allows eg. the RCU stall detector, or the soft/hardlockup
      detectors to trigger a backtrace on all CPUs.
      
      We implement this by sending a "safe" NMI, which will actually only
      send an IPI. Unfortunately the generic code prints "NMI", so that's a
      little confusing but we can probably live with it.
      
      If one of the CPUs doesn't respond to the IPI, we then print some info
      from it's paca and do a backtrace based on its saved_r1.
      
      Example output:
      
        INFO: rcu_sched detected stalls on CPUs/tasks:
        	2-...0: (0 ticks this GP) idle=1be/1/4611686018427387904 softirq=1055/1055 fqs=25735
        	(detected by 4, t=58847 jiffies, g=58, c=57, q=1258)
        Sending NMI from CPU 4 to CPUs 2:
        CPU 2 didn't respond to backtrace IPI, inspecting paca.
        irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 3623 (bash)
        Back trace of paca->saved_r1 (0xc0000000e1c83ba0) (possibly stale):
        Call Trace:
        [c0000000e1c83ba0] [0000000000000014] 0x14 (unreliable)
        [c0000000e1c83bc0] [c000000000765798] lkdtm_do_action+0x48/0x80
        [c0000000e1c83bf0] [c000000000765a40] direct_entry+0x110/0x1b0
        [c0000000e1c83c90] [c00000000058e650] full_proxy_write+0x90/0xe0
        [c0000000e1c83ce0] [c0000000003aae3c] __vfs_write+0x6c/0x1f0
        [c0000000e1c83d80] [c0000000003ab214] vfs_write+0xd4/0x240
        [c0000000e1c83dd0] [c0000000003ab5cc] ksys_write+0x6c/0x110
        [c0000000e1c83e30] [c00000000000b860] system_call+0x58/0x6c
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      5cc05910
    • M
      powerpc/nmi: Add an API for sending "safe" NMIs · 6ba55716
      Michael Ellerman 提交于
      Currently the options we have for sending NMIs are not necessarily
      safe, that is they can potentially interrupt a CPU in a
      non-recoverable region of code, meaning the kernel must then panic().
      
      But we'd like to use smp_send_nmi_ipi() to do cross-CPU calls in
      situations where we don't want to risk a panic(), because it doesn't
      have the requirement that interrupts must be enabled like
      smp_call_function().
      
      So add an API for the caller to indicate that it wants to use the NMI
      infrastructure, but doesn't want to do anything "unsafe".
      
      Currently that is implemented by not actually calling cause_nmi_ipi(),
      instead falling back to an IPI. In future we can pass the safe
      parameter down to cause_nmi_ipi() and the individual backends can
      potentially take it into account before deciding what to do.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      6ba55716
    • M
      powerpc/64: Save stack pointer when we hard disable interrupts · 7b08729c
      Michael Ellerman 提交于
      A CPU that gets stuck with interrupts hard disable can be difficult to
      debug, as on some platforms we have no way to interrupt the CPU to
      find out what it's doing.
      
      A stop-gap is to have the CPU save it's stack pointer (r1) in its paca
      when it hard disables interrupts. That way if we can't interrupt it,
      we can at least trace the stack based on where it last disabled
      interrupts.
      
      In some cases that will be total junk, but the stack trace code should
      handle that. In the simple case of a CPU that disable interrupts and
      then gets stuck in a loop, the stack trace should be informative.
      
      We could clear the saved stack pointer when we enable interrupts, but
      that loses information which could be useful if we have nothing else
      to go on.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      7b08729c
    • M
      powerpc: Check address limit on user-mode return (TIF_FSCHECK) · 3e378680
      Michael Ellerman 提交于
      set_fs() sets the addr_limit, which is used in access_ok() to
      determine if an address is a user or kernel address.
      
      Some code paths use set_fs() to temporarily elevate the addr_limit so
      that kernel code can read/write kernel memory as if it were user
      memory. That is fine as long as the code can't ever return to
      userspace with the addr_limit still elevated.
      
      If that did happen, then userspace can read/write kernel memory as if
      it were user memory, eg. just with write(2). In case it's not clear,
      that is very bad. It has also happened in the past due to bugs.
      
      Commit 5ea0727b ("x86/syscalls: Check address limit on user-mode
      return") added a mechanism to check the addr_limit value before
      returning to userspace. Any call to set_fs() sets a thread flag,
      TIF_FSCHECK, and if we see that on the return to userspace we go out
      of line to check that the addr_limit value is not elevated.
      
      For further info see the above commit, as well as:
        https://lwn.net/Articles/722267/
        https://bugs.chromium.org/p/project-zero/issues/detail?id=990
      
      Verified to work on 64-bit Book3S using a POC that objdumps the system
      call handler, and a modified lkdtm_CORRUPT_USER_DS() that doesn't kill
      the caller.
      
      Before:
        $ sudo ./test-tif-fscheck
        ...
        0000000000000000 <.data>:
               0:       e1 f7 8a 79     rldicl. r10,r12,30,63
               4:       80 03 82 40     bne     0x384
               8:       00 40 8a 71     andi.   r10,r12,16384
               c:       78 0b 2a 7c     mr      r10,r1
              10:       10 fd 21 38     addi    r1,r1,-752
              14:       08 00 c2 41     beq-    0x1c
              18:       58 09 2d e8     ld      r1,2392(r13)
              1c:       00 00 41 f9     std     r10,0(r1)
              20:       70 01 61 f9     std     r11,368(r1)
              24:       78 01 81 f9     std     r12,376(r1)
              28:       70 00 01 f8     std     r0,112(r1)
              2c:       78 00 41 f9     std     r10,120(r1)
              30:       20 00 82 41     beq     0x50
              34:       a6 42 4c 7d     mftb    r10
      
      After:
      
        $ sudo ./test-tif-fscheck
        Killed
      
      And in dmesg:
        Invalid address limit on user-mode return
        WARNING: CPU: 1 PID: 3689 at ../include/linux/syscalls.h:260 do_notify_resume+0x140/0x170
        ...
        NIP [c00000000001ee50] do_notify_resume+0x140/0x170
        LR [c00000000001ee4c] do_notify_resume+0x13c/0x170
        Call Trace:
          do_notify_resume+0x13c/0x170 (unreliable)
          ret_from_except_lite+0x70/0x74
      
      Performance overhead is essentially zero in the usual case, because
      the bit is checked as part of the existing _TIF_USER_WORK_MASK check.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3e378680
    • M
      powerpc: Rename thread_struct.fs to addr_limit · ba0635fc
      Michael Ellerman 提交于
      It's called 'fs' for historical reasons, it's named after the x86 'FS'
      register. But we don't have to use that name for the member of
      thread_struct, and in fact arch/x86 doesn't even call it 'fs' anymore.
      
      So rename it to 'addr_limit', which better reflects what it's used
      for, and is also the name used on other arches.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ba0635fc
    • S
      powerpc/eeh: Introduce eeh_for_each_pe() · 309ed3a7
      Sam Bobroff 提交于
      Add a for_each-style macro for iterating through PEs without the
      boilerplate required by a traversal function. eeh_pe_next() is now
      exported, as it is now used directly in place.
      Signed-off-by: NSam Bobroff <sbobroff@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      309ed3a7
    • S
      powerpc/eeh: Strengthen types of eeh traversal functions · d6c4932f
      Sam Bobroff 提交于
      The traversal functions eeh_pe_traverse() and eeh_pe_dev_traverse()
      both provide their first argument as void * but every single user casts
      it to the expected type.
      
      Change the type of the first parameter from void * to the appropriate
      type, and clean up all uses.
      Signed-off-by: NSam Bobroff <sbobroff@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d6c4932f
    • A
      powerpc/perf: Unregister thread-imc if core-imc not supported · 25af86b2
      Anju T Sudhakar 提交于
      Since thread-imc internally use the core-imc hardware infrastructure
      and is depended on it, having thread-imc in the kernel in the
      absence of core-imc is trivial. Patch disables thread-imc, if
      core-imc is not registered.
      Signed-off-by: NAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reviewed-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      25af86b2
    • R
      powerpc/xive: Remove (almost) unused macros · 8a792262
      Russell Currey 提交于
      The GETFIELD and SETFIELD macros in xive-regs.h aren't used except for
      a single instance of GETFIELD, so replace that and remove them.
      
      These macros are also defined in vas.h, so either those should be
      eventually replaced or the macros moved into bitops.h.
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      [mpe: Rewrite the assignment to 'he' to avoid ffs() etc.]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8a792262
    • A
      powerpc: remove unused to_tm() helper · 34efabe4
      Arnd Bergmann 提交于
      to_tm() is now completely unused, the only reference being in the
      _dump_time() helper that is also unused. This removes both, leaving
      the rest of the powerpc RTC code y2038 safe to as far as the hardware
      supports.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      34efabe4
    • A
      powerpc: use time64_t in read_persistent_clock · 5bfd6435
      Arnd Bergmann 提交于
      Looking through the remaining users of the deprecated mktime()
      function, I found the powerpc rtc handlers, which use it in
      place of rtc_tm_to_time64().
      
      To clean this up, I'm changing over the read_persistent_clock()
      function to the read_persistent_clock64() variant, and change
      all the platform specific handlers along with it.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5bfd6435
    • N
      powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask · 0cef77c7
      Nicholas Piggin 提交于
      When a single-threaded process has a non-local mm_cpumask, try to use
      that point to flush the TLBs out of other CPUs in the cpumask.
      
      An IPI is used for clearing remote CPUs for a few reasons:
      - An IPI can end lazy TLB use of the mm, which is required to prevent
        TLB entries being created on the remote CPU. The alternative is to
        drop lazy TLB switching completely, which costs 7.5% in a context
        switch ping-pong test betwee a process and kernel idle thread.
      - An IPI can have remote CPUs flush the entire PID, but the local CPU
        can flush a specific VA. tlbie would require over-flushing of the
        local CPU (where the process is running).
      - A single threaded process that is migrated to a different CPU is
        likely to have a relatively small mm_cpumask, so IPI is reasonable.
      
      No other thread can concurrently switch to this mm, because it must
      have been given a reference to mm_users by the current thread before it
      can use_mm. mm_users can be asynchronously incremented (by
      mm_activate or mmget_not_zero), but those users must use remote mm
      access and can't use_mm or access user address space. Existing code
      makes the this assumption already, for example sparc64 has reset
      mm_cpumask using this condition since the start of history, see
      arch/sparc/kernel/smp_64.c.
      
      This reduces tlbies for a kernel compile workload from 0.90M to 0.12M,
      tlbiels are increased significantly due to the PID flushing for the
      cleaning up remote CPUs, and increased local flushes (PID flushes take
      128 tlbiels vs 1 tlbie).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0cef77c7
    • N
      powerpc/64s/radix: optimise pte_update · 85bcfaf6
      Nicholas Piggin 提交于
      Implementing pte_update with pte_xchg (which uses cmpxchg) is
      inefficient. A single larx/stcx. works fine, no need for the less
      efficient cmpxchg sequence.
      
      Then remove the memory barriers from the operation. There is a
      requirement for TLB flushing to load mm_cpumask after the store
      that reduces pte permissions, which is moved into the TLB flush
      code.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      85bcfaf6
    • N
      powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags · f1cb8f9b
      Nicholas Piggin 提交于
      The ISA suggests ptesync after setting a pte, to prevent a table walk
      initiated by a subsequent access from missing that store and causing a
      spurious fault. This is an architectual allowance that allows an
      implementation's page table walker to be incoherent with the store
      queue.
      
      However there is no correctness problem in taking a spurious fault in
      userspace -- the kernel copes with these at any time, so the updated
      pte will be found eventually. Spurious kernel faults on vmap memory
      must be avoided, so a ptesync is put into flush_cache_vmap.
      
      On POWER9 so far I have not found a measurable window where this can
      result in more minor faults, so as an optimisation, remove the costly
      ptesync from pte updates. If an implementation benefits from ptesync,
      it would be better to add it back in update_mmu_cache, so it's not
      done for things like fork(2).
      
      fork --fork --exec benchmark improved 5.2% (12400->13100).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f1cb8f9b
    • N
      powerpc/64s/radix: make ptep_get_and_clear_full non-atomic for the full case · f569bd94
      Nicholas Piggin 提交于
      This matches other architectures, when we know there will be no
      further accesses to the address (e.g., for teardown), page table
      entries can be cleared non-atomically.
      
      The comments about NMMU are bogus: all MMU notifiers (including NMMU)
      are released at this point, with their TLBs flushed. An NMMU access at
      this point would be a bug.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f569bd94
    • N
      powerpc/64s/radix: do not flush TLB on spurious fault · 6d8278c4
      Nicholas Piggin 提交于
      In the case of a spurious fault (which can happen due to a race with
      another thread that changes the page table), the default Linux mm code
      calls flush_tlb_page for that address. This is not required because
      the pte will be re-fetched. Hash does not wire this up to a hardware
      TLB flush for this reason. This patch avoids the flush for radix.
      
      >From Power ISA v3.0B, p.1090:
      
          Setting a Reference or Change Bit or Upgrading Access Authority
          (PTE Subject to Atomic Hardware Updates)
      
          If the only change being made to a valid PTE that is subject to
          atomic hardware updates is to set the Refer- ence or Change bit to
          1 or to add access authorities, a simpler sequence suffices
          because the translation hardware will refetch the PTE if an access
          is attempted for which the only problems were reference and/or
          change bits needing to be set or insufficient access authority.
      
      The nest MMU on POWER9 does not re-fetch the PTE after such an access
      attempt before faulting, so address spaces with a coprocessor
      attached will continue to flush in these cases.
      
      This reduces tlbies for a kernel compile workload from 0.95M to 0.90M.
      
      fork --fork --exec benchmark improved 0.5% (12300->12400).
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6d8278c4
    • A
      powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang · bd5050e3
      Aneesh Kumar K.V 提交于
      When relaxing access (read -> read_write update), pte needs to be marked invalid
      to handle a nest MMU bug. We also need to do a tlb flush after the pte is
      marked invalid before updating the pte with new access bits.
      
      We also move tlb flush to platform specific __ptep_set_access_flags. This will
      help us to gerid of unnecessary tlb flush on BOOK3S 64 later. We don't do that
      in this patch. This also helps in avoiding multiple tlbies with coprocessor
      attached.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bd5050e3
    • A
      powerpc/mm: Change function prototype · e4c1112c
      Aneesh Kumar K.V 提交于
      In later patch, we use the vma and psize to do tlb flush. Do the prototype
      update in separate patch to make the review easy.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e4c1112c
    • A
      powerpc/mm/radix: Move function from radix.h to pgtable-radix.c · 044003b5
      Aneesh Kumar K.V 提交于
      In later patch we will update them which require them to be moved
      to pgtable-radix.c. Keeping the function in radix.h results in
      compile warning as below.
      
      ./arch/powerpc/include/asm/book3s/64/radix.h: In function ‘radix__ptep_set_access_flags’:
      ./arch/powerpc/include/asm/book3s/64/radix.h:196:28: error: dereferencing pointer to incomplete type ‘struct vm_area_struct’
        struct mm_struct *mm = vma->vm_mm;
                                  ^~
      ./arch/powerpc/include/asm/book3s/64/radix.h:204:6: error: implicit declaration of function ‘atomic_read’; did you mean ‘__atomic_load’? [-Werror=implicit-function-declaration]
            atomic_read(&mm->context.copros) > 0) {
            ^~~~~~~~~~~
            __atomic_load
      ./arch/powerpc/include/asm/book3s/64/radix.h:204:21: error: dereferencing pointer to incomplete type ‘struct mm_struct’
            atomic_read(&mm->context.copros) > 0) {
      
      Instead of fixing header dependencies, we move the function to pgtable-radix.c
      Also the function is now large to be a static inline . Doing the
      move in separate patch helps in review.
      
      No functional change in this patch. Only code movement.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      044003b5
    • A
      powerpc/mm/hugetlb: Update huge_ptep_set_access_flags to call __ptep_set_access_flags directly · f069ff39
      Aneesh Kumar K.V 提交于
      In a later patch, we want to update __ptep_set_access_flags take page size
      arg. This makes ptep_set_access_flags only work with mmu_virtual_psize.
      To simplify the code make huge_ptep_set_access_flags directly call
      __ptep_set_access_flags so that we can compute the hugetlb page size in
      hugetlb function.
      
      Now that ptep_set_access_flags won't be called for hugetlb remove
      the is_vm_hugetlb_page() check and add the assert of pte lock
      unconditionally.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f069ff39
    • A
      ocxl: Rename pnv_ocxl_spa_remove_pe to clarify it's action · 19df3958
      Alastair D'Silva 提交于
      The function removes the process element from NPU cache.
      Signed-off-by: NAlastair D'Silva <alastair@d-silva.org>
      Acked-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Acked-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      19df3958
    • A
      powerpc: use task_pid_nr() for TID allocation · 71cc64a8
      Alastair D'Silva 提交于
      The current implementation of TID allocation, using a global IDR, may
      result in an errant process starving the system of available TIDs.
      Instead, use task_pid_nr(), as mentioned by the original author. The
      scenario described which prevented it's use is not applicable, as
      set_thread_tidr can only be called after the task struct has been
      populated.
      
      In the unlikely event that 2 threads share the TID and are waiting,
      all potential outcomes have been determined safe.
      Signed-off-by: NAlastair D'Silva <alastair@d-silva.org>
      Reviewed-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      71cc64a8
    • A
      powerpc: Add TIDR CPU feature for POWER9 · 81984428
      Alastair D'Silva 提交于
      This patch adds a CPU feature bit to show whether the CPU has
      the TIDR register available, enabling as_notify/wait in userspace.
      Signed-off-by: NAlastair D'Silva <alastair@d-silva.org>
      Reviewed-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      81984428
    • N
      powerpc/powernv: call OPAL_QUIESCE before OPAL_SIGNAL_SYSTEM_RESET · ee03b9b4
      Nicholas Piggin 提交于
      Although it is often possible to recover a CPU that was interrupted
      from OPAL with a system reset NMI, it's undesirable to interrupt them
      for a few reasons. Firstly because dump/debug code itself needs to
      call firmware, so it could hang on a lock or possibly corrupt a
      per-cpu data structure if it or another CPU was interrupted from
      OPAL. Secondly, the kexec crash dump code will not return from
      interrupt to unwind the OPAL call.
      
      Call OPAL_QUIESCE with QUIESCE_HOLD before sending an NMI IPI to
      another CPU, which wait for it to leave firmware (or time out) to
      avoid this problem in normal conditions. Firmware bugs may still
      result in a timeout and interrupting OPAL, but that is the best
      option (stops the CPU, and possibly allows firmware to be debugged).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ee03b9b4
    • N
      powerpc/time: account broadcast timer event interrupts separately · e360cd37
      Nicholas Piggin 提交于
      These are not local timer interrupts but IPIs. It's good to be able
      to see how timer offloading is behaving, so split these out into
      their own category.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e360cd37
    • N
      powerpc: generic clockevents broadcast receiver call tick_receive_broadcast · 3f984620
      Nicholas Piggin 提交于
      The broadcast tick recipient can call tick_receive_broadcast rather
      than re-running the full timer interrupt.
      
      It does not have to check for the next event time, because the sender
      already determined the timer has expired. It does not have to test
      irq_work_pending, because that's a direct decrementer interrupt and
      does not go through the clock events subsystem. And it does not have
      to read PURR because that was removed with the previous patch.
      
      This results in no code size change, but both the decrementer and
      broadcast path lengths are reduced.
      
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3f984620
    • N
      powerpc/pseries: lparcfg calculate PURR on demand · 3d3a6021
      Nicholas Piggin 提交于
      For SPLPAR, lparcfg provides a sum of PURR registers for all CPUs.
      Currently this is done by reading PURR in context switch and timer
      interrupt, and storing that into a per-CPU variable. These are summed
      to provide the value.
      
      This does not work with all timer schemes (e.g., NO_HZ_FULL), and it
      is sub-optimal for performance because it reads the PURR register on
      every context switch, although that's been difficult to distinguish
      from noise in the contxt_switch microbenchmark.
      
      This patch implements the sum by calling a function on each CPU, to
      read and add PURR values of each CPU.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3d3a6021
    • N
      powerpc/64: remove start_tb and accum_tb from thread_struct · 36d632ea
      Nicholas Piggin 提交于
      These fields are only written to.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      36d632ea
    • N
      powerpc/64s: micro-optimise __hard_irq_enable() for mtmsrd L=1 support · 54071e41
      Nicholas Piggin 提交于
      Book3S minimum supported ISA version now requires mtmsrd L=1. This
      instruction does not require bits other than RI and EE to be supplied,
      so __hard_irq_enable() and __hard_irq_disable() does not have to read
      the kernel_msr from paca.
      
      Interrupt entry code already relies on L=1 support.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      54071e41