1. 10 11月, 2019 1 次提交
  2. 12 10月, 2019 1 次提交
  3. 08 10月, 2019 1 次提交
  4. 01 10月, 2019 1 次提交
  5. 19 9月, 2019 1 次提交
  6. 16 9月, 2019 3 次提交
  7. 31 7月, 2019 1 次提交
  8. 25 6月, 2019 1 次提交
  9. 22 6月, 2019 1 次提交
    • P
      KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list · b376683f
      Paul Mackerras 提交于
      [ Upstream commit 1659e27d2bc1ef47b6d031abe01b467f18cb72d9 ]
      
      Currently the Book 3S KVM code uses kvm->lock to synchronize access
      to the kvm->arch.rtas_tokens list.  Because this list is scanned
      inside kvmppc_rtas_hcall(), which is called with the vcpu mutex held,
      taking kvm->lock cause a lock inversion problem, which could lead to
      a deadlock.
      
      To fix this, we add a new mutex, kvm->arch.rtas_token_lock, which nests
      inside the vcpu mutexes, and use that instead of kvm->lock when
      accessing the rtas token list.
      
      This removes the lockdep_assert_held() in kvmppc_rtas_tokens_free().
      At this point we don't hold the new mutex, but that is OK because
      kvmppc_rtas_tokens_free() is only called when the whole VM is being
      destroyed, and at that point nothing can be looking up a token in
      the list.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b376683f
  10. 04 6月, 2019 1 次提交
  11. 17 5月, 2019 2 次提交
    • L
      powerpc/booke64: set RI in default MSR · 4179b858
      Laurentiu Tudor 提交于
      commit 5266e58d6cd90ac85c187d673093ad9cb649e16d upstream.
      
      Set RI in the default kernel's MSR so that the architected way of
      detecting unrecoverable machine check interrupts has a chance to work.
      This is inline with the MSR setup of the rest of booke powerpc
      architectures configured here.
      Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4179b858
    • R
      powerpc/book3s/64: check for NULL pointer in pgd_alloc() · 69c2b71c
      Rick Lindsley 提交于
      commit f39356261c265a0689d7ee568132d516e8b6cecc upstream.
      
      When the memset code was added to pgd_alloc(), it failed to consider
      that kmem_cache_alloc() can return NULL. It's uncommon, but not
      impossible under heavy memory contention. Example oops:
      
        Unable to handle kernel paging request for data at address 0x00000000
        Faulting instruction address: 0xc0000000000a4000
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE SMP NR_CPUS=2048 NUMA pSeries
        CPU: 70 PID: 48471 Comm: entrypoint.sh Kdump: loaded Not tainted 4.14.0-115.6.1.el7a.ppc64le #1
        task: c000000334a00000 task.stack: c000000331c00000
        NIP:  c0000000000a4000 LR: c00000000012f43c CTR: 0000000000000020
        REGS: c000000331c039c0 TRAP: 0300   Not tainted  (4.14.0-115.6.1.el7a.ppc64le)
        MSR:  800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 44022840  XER: 20040000
        CFAR: c000000000008874 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
        ...
        NIP [c0000000000a4000] memset+0x68/0x104
        LR [c00000000012f43c] mm_init+0x27c/0x2f0
        Call Trace:
          mm_init+0x260/0x2f0 (unreliable)
          copy_mm+0x11c/0x638
          copy_process.isra.28.part.29+0x6fc/0x1080
          _do_fork+0xdc/0x4c0
          ppc_clone+0x8/0xc
        Instruction dump:
        409e000c b0860000 38c60002 409d000c 90860000 38c60004 78a0d183 78a506a0
        7c0903a6 41820034 60000000 60420000 <f8860000> f8860008 f8860010 f8860018
      
      Fixes: fc5c2f4a ("powerpc/mm/hash64: Zero PGD pages on allocation")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NRick Lindsley <ricklind@vnet.linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69c2b71c
  12. 06 4月, 2019 1 次提交
  13. 03 4月, 2019 4 次提交
  14. 27 3月, 2019 1 次提交
    • M
      powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038 · b8ea151a
      Michael Ellerman 提交于
      commit b5b4453e7912f056da1ca7572574cada32ecb60c upstream.
      
      Jakub Drnec reported:
        Setting the realtime clock can sometimes make the monotonic clock go
        back by over a hundred years. Decreasing the realtime clock across
        the y2k38 threshold is one reliable way to reproduce. Allegedly this
        can also happen just by running ntpd, I have not managed to
        reproduce that other than booting with rtc at >2038 and then running
        ntp. When this happens, anything with timers (e.g. openjdk) breaks
        rather badly.
      
      And included a test case (slightly edited for brevity):
        #define _POSIX_C_SOURCE 199309L
        #include <stdio.h>
        #include <time.h>
        #include <stdlib.h>
        #include <unistd.h>
      
        long get_time(void) {
          struct timespec tp;
          clock_gettime(CLOCK_MONOTONIC, &tp);
          return tp.tv_sec + tp.tv_nsec / 1000000000;
        }
      
        int main(void) {
          long last = get_time();
          while(1) {
            long now = get_time();
            if (now < last) {
              printf("clock went backwards by %ld seconds!\n", last - now);
            }
            last = now;
            sleep(1);
          }
          return 0;
        }
      
      Which when run concurrently with:
       # date -s 2040-1-1
       # date -s 2037-1-1
      
      Will detect the clock going backward.
      
      The root cause is that wtom_clock_sec in struct vdso_data is only a
      32-bit signed value, even though we set its value to be equal to
      tk->wall_to_monotonic.tv_sec which is 64-bits.
      
      Because the monotonic clock starts at zero when the system boots the
      wall_to_montonic.tv_sec offset is negative for current and future
      dates. Currently on a freshly booted system the offset will be in the
      vicinity of negative 1.5 billion seconds.
      
      However if the wall clock is set past the Y2038 boundary, the offset
      from wall to monotonic becomes less than negative 2^31, and no longer
      fits in 32-bits. When that value is assigned to wtom_clock_sec it is
      truncated and becomes positive, causing the VDSO assembly code to
      calculate CLOCK_MONOTONIC incorrectly.
      
      That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
      it is not meant to do. Worse, if the time is then set back before the
      Y2038 boundary CLOCK_MONOTONIC will jump backward.
      
      We can fix it simply by storing the full 64-bit offset in the
      vdso_data, and using that in the VDSO assembly code. We also shuffle
      some of the fields in vdso_data to avoid creating a hole.
      
      The original commit that added the CLOCK_MONOTONIC support to the VDSO
      did actually use a 64-bit value for wtom_clock_sec, see commit
      a7f290da ("[PATCH] powerpc: Merge vdso's and add vdso support to
      32 bits kernel") (Nov 2005). However just 3 days later it was
      converted to 32-bits in commit 0c37ec2a ("[PATCH] powerpc: vdso
      fixes (take #2)"), and the bug has existed since then AFAICS.
      
      Fixes: 0c37ec2a ("[PATCH] powerpc: vdso fixes (take #2)")
      Cc: stable@vger.kernel.org # v2.6.15+
      Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.czReported-by: NJakub Drnec <jaydee@email.cz>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8ea151a
  15. 24 3月, 2019 3 次提交
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 23ad135a
      Sean Christopherson 提交于
      commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.
      
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23ad135a
    • A
      powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration · baed68a9
      Aneesh Kumar K.V 提交于
      commit 35f2806b481f5b9207f25e1886cba5d1c4d12cc7 upstream.
      
      We added runtime allocation of 16G pages in commit 4ae279c2
      ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.") That was done
      to enable 16G allocation on PowerNV and KVM config. In case of KVM
      config, we mostly would have the entire guest RAM backed by 16G
      hugetlb pages for this to work. PAPR do support partial backing of
      guest RAM with hugepages via ibm,expected#pages node of memory node in
      the device tree. This means rest of the guest RAM won't be backed by
      16G contiguous pages in the host and hence a hash page table insertion
      can fail in such case.
      
      An example error message will look like
      
        hash-mmu: mm: Hashing failure ! EA=0x7efc00000000 access=0x8000000000000006 current=readback
        hash-mmu:     trap=0x300 vsid=0x67af789 ssize=1 base psize=14 psize 14 pte=0xc000000400000386
        readback[12260]: unhandled signal 7 at 00007efc00000000 nip 00000000100012d0 lr 000000001000127c code 2
      
      This patch address that by preventing runtime allocation of 16G
      hugepages in LPAR config. To allocate 16G hugetlb one need to kernel
      command line hugepagesz=16G hugepages=<number of 16G pages>
      
      With radix translation mode we don't run into this issue.
      
      This change will prevent runtime allocation of 16G hugetlb pages on
      kvm with hash translation mode. However, with the current upstream it
      was observed that 16G hugetlbfs backed guest doesn't boot at all.
      
      We observe boot failure with the below message:
        [131354.647546] KVM: map_vrma at 0 failed, ret=-4
      
      That means this patch is not resulting in an observable regression.
      Once we fix the boot issue with 16G hugetlb backed memory, we need to
      use ibm,expected#pages memory node attribute to indicate 16G page
      reservation to the guest. This will also enable partial backing of
      guest RAM with 16G pages.
      
      Fixes: 4ae279c2 ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      baed68a9
    • P
      powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit · 3bf8ff7b
      Paul Mackerras 提交于
      commit 19f8a5b5be2898573a5e1dc1db93e8d40117606a upstream.
      
      Commit 24be85a2 ("powerpc/powernv: Clear PECE1 in LPCR via stop-api
      only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg()
      inside pnv_cpu_offline(), with the aim of changing the LPCR value in
      the SLW image to disable wakeups from the decrementer while a CPU is
      offline.  However, pnv_cpu_offline() gets called each time a secondary
      CPU thread is woken up to participate in running a KVM guest, that is,
      not just when a CPU is offlined.
      
      Since opal_slw_set_reg() is a very slow operation (with observed
      execution times around 20 milliseconds), this means that an offline
      secondary CPU can often be busy doing the opal_slw_set_reg() call
      when the primary CPU wants to grab all the secondary threads so that
      it can run a KVM guest.  This leads to messages like "KVM: couldn't
      grab CPU n" being printed and guest execution failing.
      
      There is no need to reprogram the SLW image on every KVM guest entry
      and exit.  So that we do it only when a CPU is really transitioning
      between online and offline, this moves the calls to
      pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self().
      
      Fixes: 24be85a2 ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3bf8ff7b
  16. 15 2月, 2019 1 次提交
    • A
      powerpc/radix: Fix kernel crash with mremap() · 49c473e1
      Aneesh Kumar K.V 提交于
      commit 579b9239c1f38665b21e8d0e6ee83ecc96dbd6bb upstream.
      
      With support for split pmd lock, we use pmd page pmd_huge_pte pointer
      to store the deposited page table. In those config when we move page
      tables we need to make sure we move the deposited page table to the
      correct pmd page. Otherwise this can result in crash when we withdraw
      of deposited page table because we can find the pmd_huge_pte NULL.
      
      eg:
      
        __split_huge_pmd+0x1070/0x1940
        __split_huge_pmd+0xe34/0x1940 (unreliable)
        vma_adjust_trans_huge+0x110/0x1c0
        __vma_adjust+0x2b4/0x9b0
        __split_vma+0x1b8/0x280
        __do_munmap+0x13c/0x550
        sys_mremap+0x220/0x7e0
        system_call+0x5c/0x70
      
      Fixes: 675d9952 ("powerpc/book3s64: Enable split pmd ptlock.")
      Cc: stable@vger.kernel.org # v4.18+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      49c473e1
  17. 13 2月, 2019 2 次提交
    • M
      powerpc/fadump: Do not allow hot-remove memory from fadump reserved area. · fc090081
      Mahesh Salgaonkar 提交于
      [ Upstream commit 0db6896ff6332ba694f1e61b93ae3b2640317633 ]
      
      For fadump to work successfully there should not be any holes in reserved
      memory ranges where kernel has asked firmware to move the content of old
      kernel memory in event of crash. Now that fadump uses CMA for reserved
      area, this memory area is now not protected from hot-remove operations
      unless it is cma allocated. Hence, fadump service can fail to re-register
      after the hot-remove operation, if hot-removed memory belongs to fadump
      reserved region. To avoid this make sure that memory from fadump reserved
      area is not hot-removable if fadump is registered.
      
      However, if user still wants to remove that memory, he can do so by
      manually stopping fadump service before hot-remove operation.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fc090081
    • C
      powerpc/uaccess: fix warning/error with access_ok() · e1219431
      Christophe Leroy 提交于
      [ Upstream commit 05a4ab823983d9136a460b7b5e0d49ee709a6f86 ]
      
      With the following piece of code, the following compilation warning
      is encountered:
      
      	if (_IOC_DIR(ioc) != _IOC_NONE) {
      		int verify = _IOC_DIR(ioc) & _IOC_READ ? VERIFY_WRITE : VERIFY_READ;
      
      		if (!access_ok(verify, ioarg, _IOC_SIZE(ioc))) {
      
      drivers/platform/test/dev.c: In function 'my_ioctl':
      drivers/platform/test/dev.c:219:7: warning: unused variable 'verify' [-Wunused-variable]
         int verify = _IOC_DIR(ioc) & _IOC_READ ? VERIFY_WRITE : VERIFY_READ;
      
      This patch fixes it by referencing 'type' in the macro allthough
      doing nothing with it.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e1219431
  18. 01 12月, 2018 1 次提交
    • M
      powerpc/io: Fix the IO workarounds code to work with Radix · b7718632
      Michael Ellerman 提交于
      [ Upstream commit 43c6494fa1499912c8177e71450c0279041152a6 ]
      
      Back in 2006 Ben added some workarounds for a misbehaviour in the
      Spider IO bridge used on early Cell machines, see commit
      014da7ff ("[POWERPC] Cell "Spider" MMIO workarounds"). Later these
      were made to be generic, ie. not tied specifically to Spider.
      
      The code stashes a token in the high bits (59-48) of virtual addresses
      used for IO (eg. returned from ioremap()). This works fine when using
      the Hash MMU, but when we're using the Radix MMU the bits used for the
      token overlap with some of the bits of the virtual address.
      
      This is because the maximum virtual address is larger with Radix, up
      to c00fffffffffffff, and in fact we use that high part of the address
      range for ioremap(), see RADIX_KERN_IO_START.
      
      As it happens the bits that are used overlap with the bits that
      differentiate an IO address vs a linear map address. If the resulting
      address lies outside the linear mapping we will crash (see below), if
      not we just corrupt memory.
      
        virtio-pci 0000:00:00.0: Using 64-bit direct DMA at offset 800000000000000
        Unable to handle kernel paging request for data at address 0xc000000080000014
        ...
        CFAR: c000000000626b98 DAR: c000000080000014 DSISR: 42000000 IRQMASK: 0
        GPR00: c0000000006c54fc c00000003e523378 c0000000016de600 0000000000000000
        GPR04: c00c000080000014 0000000000000007 0fffffff000affff 0000000000000030
               ^^^^
        ...
        NIP [c000000000626c5c] .iowrite8+0xec/0x100
        LR [c0000000006c992c] .vp_reset+0x2c/0x90
        Call Trace:
          .pci_bus_read_config_dword+0xc4/0x120 (unreliable)
          .register_virtio_device+0x13c/0x1c0
          .virtio_pci_probe+0x148/0x1f0
          .local_pci_probe+0x68/0x140
          .pci_device_probe+0x164/0x220
          .really_probe+0x274/0x3b0
          .driver_probe_device+0x80/0x170
          .__driver_attach+0x14c/0x150
          .bus_for_each_dev+0xb8/0x130
          .driver_attach+0x34/0x50
          .bus_add_driver+0x178/0x2f0
          .driver_register+0x90/0x1a0
          .__pci_register_driver+0x6c/0x90
          .virtio_pci_driver_init+0x2c/0x40
          .do_one_initcall+0x64/0x280
          .kernel_init_freeable+0x36c/0x474
          .kernel_init+0x24/0x160
          .ret_from_kernel_thread+0x58/0x7c
      
      This hasn't been a problem because CONFIG_PPC_IO_WORKAROUNDS which
      enables this code is usually not enabled. It is only enabled when it's
      selected by PPC_CELL_NATIVE which is only selected by
      PPC_IBM_CELL_BLADE and that in turn depends on BIG_ENDIAN. So in order
      to hit the bug you need to build a big endian kernel, with IBM Cell
      Blade support enabled, as well as Radix MMU support, and then boot
      that on Power9 using Radix MMU.
      
      Still we can fix the bug, so let's do that. We simply use fewer bits
      for the token, taking the union of the restrictions on the address
      from both Hash and Radix, we end up with 8 bits we can use for the
      token. The only user of the token is iowa_mem_find_bus() which only
      supports 8 token values, so 8 bits is plenty for that.
      
      Fixes: 566ca99a ("powerpc/mm/radix: Add dummy radix_enabled()")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b7718632
  19. 21 11月, 2018 1 次提交
  20. 14 11月, 2018 1 次提交
  21. 10 10月, 2018 1 次提交
    • J
      mm: Preserve _PAGE_DEVMAP across mprotect() calls · 4628a645
      Jan Kara 提交于
      Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a
      result we will see warnings such as:
      
      BUG: Bad page map in process JobWrk0013  pte:800001803875ea25 pmd:7624381067
      addr:00007f0930720000 vm_flags:280000f9 anon_vma:          (null) mapping:ffff97f2384056f0 index:0
      file:457-000000fe00000030-00000009-000000ca-00000001_2001.fileblock fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage:          (null)
      CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: G        W          4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased)
      Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
      Call Trace:
       dump_stack+0x5a/0x75
       print_bad_pte+0x217/0x2c0
       ? enqueue_task_fair+0x76/0x9f0
       _vm_normal_page+0xe5/0x100
       zap_pte_range+0x148/0x740
       unmap_page_range+0x39a/0x4b0
       unmap_vmas+0x42/0x90
       unmap_region+0x99/0xf0
       ? vma_gap_callbacks_rotate+0x1a/0x20
       do_munmap+0x255/0x3a0
       vm_munmap+0x54/0x80
       SyS_munmap+0x1d/0x30
       do_syscall_64+0x74/0x150
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      ...
      
      when mprotect(2) gets used on DAX mappings. Also there is a wide variety
      of other failures that can result from the missing _PAGE_DEVMAP flag
      when the area gets used by get_user_pages() later.
      
      Fix the problem by including _PAGE_DEVMAP in a set of flags that get
      preserved by mprotect(2).
      
      Fixes: 69660fd7 ("x86, mm: introduce _PAGE_DEVMAP")
      Fixes: ebd31197 ("powerpc/mm: Add devmap support for ppc64")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      4628a645
  22. 18 9月, 2018 1 次提交
    • M
      powerpc: Avoid code patching freed init sections · 51c3c62b
      Michael Neuling 提交于
      This stops us from doing code patching in init sections after they've
      been freed.
      
      In this chain:
        kvm_guest_init() ->
          kvm_use_magic_page() ->
            fault_in_pages_readable() ->
      	 __get_user() ->
      	   __get_user_nocheck() ->
      	     barrier_nospec();
      
      We have a code patching location at barrier_nospec() and
      kvm_guest_init() is an init function. This whole chain gets inlined,
      so when we free the init section (hence kvm_guest_init()), this code
      goes away and hence should no longer be patched.
      
      We seen this as userspace memory corruption when using a memory
      checker while doing partition migration testing on powervm (this
      starts the code patching post migration via
      /sys/kernel/mobility/migration). In theory, it could also happen when
      using /sys/kernel/debug/powerpc/barrier_nospec.
      
      Cc: stable@vger.kernel.org # 4.13+
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      51c3c62b
  23. 12 9月, 2018 1 次提交
    • A
      KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode · 425333bf
      Alexey Kardashevskiy 提交于
      At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm()
      which in turn reads the old TCE and if it was a valid entry, marks
      the physical page dirty if it was mapped for writing. Since it is in
      real mode, realmode_pfn_to_page() is used instead of pfn_to_page()
      to get the page struct. However SetPageDirty() itself reads the compound
      page head and returns a virtual address for the head page struct and
      setting dirty bit for that kills the system.
      
      This adds additional dirty bit tracking into the MM/IOMMU API for use
      in the real mode. Note that this does not change how VFIO and
      KVM (in virtual mode) set this bit. The KVM (real mode) changes include:
      - use the lowest bit of the cached host phys address to carry
      the dirty bit;
      - mark pages dirty when they are unpinned which happens when
      the preregistered memory is released which always happens in virtual
      mode;
      - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit;
      - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use
      in the new mm_iommu_ua_mark_dirty_rm() helper;
      - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only
      caller anyway) to reduce the real mode KVM and IOMMU knowledge
      across different subsystems.
      
      This removes realmode_pfn_to_page() as it is not used anymore.
      
      While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for
      the real mode only and modules cannot call it anyway.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      425333bf
  24. 23 8月, 2018 2 次提交
    • A
      powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid. · bd0dbb73
      Aneesh Kumar K.V 提交于
      When splitting a huge pmd pte, we need to mark the pmd entry invalid. We
      can do that by clearing _PAGE_PRESENT bit. But then that will be taken as a
      swap pte. In order to differentiate between the two use a software pte bit
      when invalidating.
      
      For regular pte, due to bd5050e3 ("powerpc/mm/radix: Change pte relax
      sequence to handle nest MMU hang") we need to mark the pte entry invalid when
      relaxing access permission. Instead of marking pte_none which can result in
      different page table walk routines possibly skipping this pte entry, invalidate
      it but still keep it marked present.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bd0dbb73
    • C
      powerpc/nohash: fix pte_access_permitted() · 810e9f86
      Christophe Leroy 提交于
      Commit 5769beaf ("powerpc/mm: Add proper pte access check helper
      for other platforms") replaced generic pte_access_permitted() by an
      arch specific one.
      
      The generic one is defined as
      (pte_present(pte) && (!(write) || pte_write(pte)))
      
      The arch specific one is open coded checking that _PAGE_USER and
      _PAGE_WRITE (_PAGE_RW) flags are set, but lacking to check that
      _PAGE_RO and _PAGE_PRIVILEGED are unset, leading to a useless test
      on targets like the 8xx which defines _PAGE_RW and _PAGE_USER as 0.
      
      Commit 5fa5b16b ("powerpc/mm/hugetlb: Use pte_access_permitted
      for hugetlb access check") replaced some tests performed with
      pte helpers by a call to pte_access_permitted(), leading to the same
      issue.
      
      This patch rewrites powerpc/nohash pte_access_permitted()
      using pte helpers.
      
      Fixes: 5769beaf ("powerpc/mm: Add proper pte access check helper for other platforms")
      Fixes: 5fa5b16b ("powerpc/mm/hugetlb: Use pte_access_permitted for hugetlb access check")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      810e9f86
  25. 21 8月, 2018 1 次提交
    • S
      powerpc/topology: Get topology for shared processors at boot · 2ea62630
      Srikar Dronamraju 提交于
      On a shared LPAR, Phyp will not update the CPU associativity at boot
      time. Just after the boot system does recognize itself as a shared
      LPAR and trigger a request for correct CPU associativity. But by then
      the scheduler would have already created/destroyed its sched domains.
      
      This causes
        - Broken load balance across Nodes causing islands of cores.
        - Performance degradation esp if the system is lightly loaded
        - dmesg to wrongly report all CPUs to be in Node 0.
        - Messages in dmesg saying borken topology.
        - With commit 051f3ca0 ("sched/topology: Introduce NUMA identity
          node sched domain"), can cause rcu stalls at boot up.
      
      The sched_domains_numa_masks table which is used to generate cpumasks
      is only created at boot time just before creating sched domains and
      never updated. Hence, its better to get the topology correct before
      the sched domains are created.
      
      For example on 64 core Power 8 shared LPAR, dmesg reports
      
        Brought up 512 CPUs
        Node 0 CPUs: 0-511
        Node 1 CPUs:
        Node 2 CPUs:
        Node 3 CPUs:
        Node 4 CPUs:
        Node 5 CPUs:
        Node 6 CPUs:
        Node 7 CPUs:
        Node 8 CPUs:
        Node 9 CPUs:
        Node 10 CPUs:
        Node 11 CPUs:
        ...
        BUG: arch topology borken
             the DIE domain not a subset of the NUMA domain
        BUG: arch topology borken
             the DIE domain not a subset of the NUMA domain
      
      numactl/lscpu output will still be correct with cores spreading across
      all nodes:
      
        Socket(s):             64
        NUMA node(s):          12
        Model:                 2.0 (pvr 004d 0200)
        Model name:            POWER8 (architected), altivec supported
        Hypervisor vendor:     pHyp
        Virtualization type:   para
        L1d cache:             64K
        L1i cache:             32K
        NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
        NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
        NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
        NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
        NUMA node4 CPU(s):     208-215,304-311,400-407,496-503
        NUMA node5 CPU(s):     168-175,264-271,360-367,456-463
        NUMA node6 CPU(s):     128-135,224-231,320-327,416-423
        NUMA node7 CPU(s):     136-143,232-239,328-335,424-431
        NUMA node8 CPU(s):     216-223,312-319,408-415,504-511
        NUMA node9 CPU(s):     144-151,240-247,336-343,432-439
        NUMA node10 CPU(s):    152-159,248-255,344-351,440-447
        NUMA node11 CPU(s):    160-167,256-263,352-359,448-455
      
      Currently on this LPAR, the scheduler detects 2 levels of Numa and
      created numa sched domains for all CPUs, but it finds a single DIE
      domain consisting of all CPUs. Hence it deletes all numa sched
      domains.
      
      To address this, detect the shared processor and update topology soon
      after CPUs are setup so that correct topology is updated just before
      scheduler creates sched domain.
      
      With the fix, dmesg reports:
      
        numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
        numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
        numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
        numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
        numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
        numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
        numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
        numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
        numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
        numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
        numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
        numa: Node 11 CPUs: 160-167 256-263 352-359 448-455
      
      and lscpu also reports:
      
        Socket(s):             64
        NUMA node(s):          12
        Model:                 2.0 (pvr 004d 0200)
        Model name:            POWER8 (architected), altivec supported
        Hypervisor vendor:     pHyp
        Virtualization type:   para
        L1d cache:             64K
        L1i cache:             32K
        NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
        NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
        NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
        NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
        NUMA node4 CPU(s):     208-215,304-311,400-407,496-503
        NUMA node5 CPU(s):     168-175,264-271,360-367,456-463
        NUMA node6 CPU(s):     128-135,224-231,320-327,416-423
        NUMA node7 CPU(s):     136-143,232-239,328-335,424-431
        NUMA node8 CPU(s):     216-223,312-319,408-415,504-511
        NUMA node9 CPU(s):     144-151,240-247,336-343,432-439
        NUMA node10 CPU(s):    152-159,248-255,344-351,440-447
        NUMA node11 CPU(s):    160-167,256-263,352-359,448-455
      Reported-by: NManjunatha H R <manjuhr1@in.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Trim / format change log]
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2ea62630
  26. 20 8月, 2018 1 次提交
  27. 18 8月, 2018 1 次提交
    • S
      mm: convert return type of handle_mm_fault() caller to vm_fault_t · 50a7ca3c
      Souptick Joarder 提交于
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      In this patch all the caller of handle_mm_fault() are changed to return
      vm_fault_t type.
      
      Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50a7ca3c
  28. 13 8月, 2018 1 次提交
  29. 10 8月, 2018 2 次提交
    • M
      powerpc/uaccess: Enable get_user(u64, *p) on 32-bit · f7a6947c
      Michael Ellerman 提交于
      Currently if you build a 32-bit powerpc kernel and use get_user() to
      load a u64 value it will fail to build with eg:
      
        kernel/rseq.o: In function `rseq_get_rseq_cs':
        kernel/rseq.c:123: undefined reference to `__get_user_bad'
      
      This is hitting the check in __get_user_size() that makes sure the
      size we're copying doesn't exceed the size of the destination:
      
        #define __get_user_size(x, ptr, size, retval)
        do {
        	retval = 0;
        	__chk_user_ptr(ptr);
        	if (size > sizeof(x))
        		(x) = __get_user_bad();
      
      Which doesn't immediately make sense because the size of the
      destination is u64, but it's not really, because __get_user_check()
      etc. internally create an unsigned long and copy into that:
      
        #define __get_user_check(x, ptr, size)
        ({
        	long __gu_err = -EFAULT;
        	unsigned long  __gu_val = 0;
      
      The problem being that on 32-bit unsigned long is not big enough to
      hold a u64. We can fix this with a trick from hpa in the x86 code, we
      statically check the type of x and set the type of __gu_val to either
      unsigned long or unsigned long long.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7a6947c
    • A
      powerpc/mm/hash: Remove unnecessary do { } while(0) loop · f405b510
      Aneesh Kumar K.V 提交于
      Avoid coverity false warnings like:
      
        *** CID 187347:  Control flow issues  (UNREACHABLE)
        /arch/powerpc/mm/hash_native_64.c: 819 in native_flush_hash_range()
        813        slot += hidx & _PTEIDX_GROUP_IX;
        814        hptep = htab_address + slot;
        815        want_v = hpte_encode_avpn(vpn, psize, ssize);
        816        hpte_v = hpte_get_old_v(hptep);
        817
        818        if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
        >>>     CID 187347:  Control flow issues  (UNREACHABLE)
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f405b510