1. 14 12月, 2017 1 次提交
  2. 28 11月, 2017 1 次提交
    • J
      KVM: Let KVM_SET_SIGNAL_MASK work as advertised · 20b7035c
      Jan H. Schönherr 提交于
      KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
      "any unblocked signal received [...] will cause KVM_RUN to return with
      -EINTR" and that "the signal will only be delivered if not blocked by
      the original signal mask".
      
      This, however, is only true, when the calling task has a signal handler
      registered for a signal. If not, signal evaluation is short-circuited for
      SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
      returning or the whole process is terminated.
      
      Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
      to that in do_sigtimedwait() to avoid short-circuiting of signals.
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      20b7035c
  3. 23 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts · ded13fc1
      Paul Mackerras 提交于
      This fixes two errors that prevent a guest using the HPT MMU from
      successfully migrating to a POWER9 host in radix MMU mode, or resizing
      its HPT when running on a radix host.
      
      The first bug was that commit 8dc6cca5 ("KVM: PPC: Book3S HV:
      Don't rely on host's page size information", 2017-09-11) missed two
      uses of hpte_base_page_size(), one in the HPT rehashing code and
      one in kvm_htab_write() (which is used on the destination side in
      migrating a HPT guest).  Instead we use kvmppc_hpte_base_page_shift().
      Having the shift count means that we can use left and right shifts
      instead of multiplication and division in a few places.
      
      Along the way, this adds a check in kvm_htab_write() to ensure that the
      page size encoding in the incoming HPTEs is recognized, and if not
      return an EINVAL error to userspace.
      
      The second bug was that kvm_htab_write was performing some but not all
      of the functions of kvmhv_setup_mmu(), resulting in the destination VM
      being left in radix mode as far as the hardware is concerned.  The
      simplest fix for now is make kvm_htab_write() call
      kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does.
      In future it would be better to refactor the code more extensively
      to remove the duplication.
      
      Fixes: 8dc6cca5 ("KVM: PPC: Book3S HV: Don't rely on host's page size information")
      Fixes: 7a84084c ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9")
      Reported-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Tested-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ded13fc1
  4. 22 11月, 2017 1 次提交
    • K
      treewide: setup_timer() -> timer_setup() (2 field) · 86cb30ec
      Kees Cook 提交于
      This converts all remaining setup_timer() calls that use a nested field
      to reach a struct timer_list. Coccinelle does not have an easy way to
      match multiple fields, so a new script is needed to change the matches of
      "&_E->_timer" into "&_E->_field1._timer" in all the rules.
      
      spatch --very-quiet --all-includes --include-headers \
      	-I ./arch/x86/include -I ./arch/x86/include/generated \
      	-I ./include -I ./arch/x86/include/uapi \
      	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
      	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
      	--dir . \
      	--cocci-file ~/src/data/timer_setup-2fields.cocci
      
      @fix_address_of depends@
      expression e;
      @@
      
       setup_timer(
      -&(e)
      +&e
       , ...)
      
      // Update any raw setup_timer() usages that have a NULL callback, but
      // would otherwise match change_timer_function_usage, since the latter
      // will update all function assignments done in the face of a NULL
      // function initialization in setup_timer().
      @change_timer_function_usage_NULL@
      expression _E;
      identifier _field1;
      identifier _timer;
      type _cast_data;
      @@
      
      (
      -setup_timer(&_E->_field1._timer, NULL, _E);
      +timer_setup(&_E->_field1._timer, NULL, 0);
      |
      -setup_timer(&_E->_field1._timer, NULL, (_cast_data)_E);
      +timer_setup(&_E->_field1._timer, NULL, 0);
      |
      -setup_timer(&_E._field1._timer, NULL, &_E);
      +timer_setup(&_E._field1._timer, NULL, 0);
      |
      -setup_timer(&_E._field1._timer, NULL, (_cast_data)&_E);
      +timer_setup(&_E._field1._timer, NULL, 0);
      )
      
      @change_timer_function_usage@
      expression _E;
      identifier _field1;
      identifier _timer;
      struct timer_list _stl;
      identifier _callback;
      type _cast_func, _cast_data;
      @@
      
      (
      -setup_timer(&_E->_field1._timer, _callback, _E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, &_callback, _E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, (_cast_func)_callback, _E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, (_cast_func)&_callback, _E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, _callback, (_cast_data)_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, &_callback, (_cast_data)_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, &_callback, (_cast_data)&_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)&_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)&_E);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
       _E->_field1._timer@_stl.function = _callback;
      |
       _E->_field1._timer@_stl.function = &_callback;
      |
       _E->_field1._timer@_stl.function = (_cast_func)_callback;
      |
       _E->_field1._timer@_stl.function = (_cast_func)&_callback;
      |
       _E._field1._timer@_stl.function = _callback;
      |
       _E._field1._timer@_stl.function = &_callback;
      |
       _E._field1._timer@_stl.function = (_cast_func)_callback;
      |
       _E._field1._timer@_stl.function = (_cast_func)&_callback;
      )
      
      // callback(unsigned long arg)
      @change_callback_handle_cast
       depends on change_timer_function_usage@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._field1;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      (
      	... when != _origarg
      	_handletype *_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _field1._timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _field1._timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(_handletype *)_origarg;
      +from_timer(_handle, t, _field1._timer);
      	... when != _origarg
      |
      	... when != _origarg
      	_handletype *_handle;
      	... when != _handle
      	_handle =
      -(void *)_origarg;
      +from_timer(_handle, t, _field1._timer);
      	... when != _origarg
      )
       }
      
      // callback(unsigned long arg) without existing variable
      @change_callback_handle_cast_no_arg
       depends on change_timer_function_usage &&
                           !change_callback_handle_cast@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._field1;
      identifier change_timer_function_usage._timer;
      type _origtype;
      identifier _origarg;
      type _handletype;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *t
       )
       {
      +	_handletype *_origarg = from_timer(_origarg, t, _field1._timer);
      +
      	... when != _origarg
      -	(_handletype *)_origarg
      +	_origarg
      	... when != _origarg
       }
      
      // Avoid already converted callbacks.
      @match_callback_converted
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
      	    !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       { ... }
      
      // callback(struct something *handle)
      @change_callback_handle_arg
       depends on change_timer_function_usage &&
      	    !match_callback_converted &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._field1;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      @@
      
       void _callback(
      -_handletype *_handle
      +struct timer_list *t
       )
       {
      +	_handletype *_handle = from_timer(_handle, t, _field1._timer);
      	...
       }
      
      // If change_callback_handle_arg ran on an empty function, remove
      // the added handler.
      @unchange_callback_handle_arg
       depends on change_timer_function_usage &&
      	    change_callback_handle_arg@
      identifier change_timer_function_usage._callback;
      identifier change_timer_function_usage._field1;
      identifier change_timer_function_usage._timer;
      type _handletype;
      identifier _handle;
      identifier t;
      @@
      
       void _callback(struct timer_list *t)
       {
      -	_handletype *_handle = from_timer(_handle, t, _field1._timer);
       }
      
      // We only want to refactor the setup_timer() data argument if we've found
      // the matching callback. This undoes changes in change_timer_function_usage.
      @unchange_timer_function_usage
       depends on change_timer_function_usage &&
                  !change_callback_handle_cast &&
                  !change_callback_handle_cast_no_arg &&
      	    !change_callback_handle_arg@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._field1;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type change_timer_function_usage._cast_data;
      @@
      
      (
      -timer_setup(&_E->_field1._timer, _callback, 0);
      +setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E);
      |
      -timer_setup(&_E._field1._timer, _callback, 0);
      +setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E);
      )
      
      // If we fixed a callback from a .function assignment, fix the
      // assignment cast now.
      @change_timer_function_assignment
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression change_timer_function_usage._E;
      identifier change_timer_function_usage._field1;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_func;
      typedef TIMER_FUNC_TYPE;
      @@
      
      (
       _E->_field1._timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_field1._timer.function =
      -&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_field1._timer.function =
      -(_cast_func)_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E->_field1._timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._field1._timer.function =
      -_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._field1._timer.function =
      -&_callback;
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._field1._timer.function =
      -(_cast_func)_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      |
       _E._field1._timer.function =
      -(_cast_func)&_callback
      +(TIMER_FUNC_TYPE)_callback
       ;
      )
      
      // Sometimes timer functions are called directly. Replace matched args.
      @change_timer_function_calls
       depends on change_timer_function_usage &&
                  (change_callback_handle_cast ||
                   change_callback_handle_cast_no_arg ||
                   change_callback_handle_arg)@
      expression _E;
      identifier change_timer_function_usage._field1;
      identifier change_timer_function_usage._timer;
      identifier change_timer_function_usage._callback;
      type _cast_data;
      @@
      
       _callback(
      (
      -(_cast_data)_E
      +&_E->_field1._timer
      |
      -(_cast_data)&_E
      +&_E._field1._timer
      |
      -_E
      +&_E->_field1._timer
      )
       )
      
      // If a timer has been configured without a data argument, it can be
      // converted without regard to the callback argument, since it is unused.
      @match_timer_function_unused_data@
      expression _E;
      identifier _field1;
      identifier _timer;
      identifier _callback;
      @@
      
      (
      -setup_timer(&_E->_field1._timer, _callback, 0);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, _callback, 0L);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E->_field1._timer, _callback, 0UL);
      +timer_setup(&_E->_field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, _callback, 0);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, _callback, 0L);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_E._field1._timer, _callback, 0UL);
      +timer_setup(&_E._field1._timer, _callback, 0);
      |
      -setup_timer(&_field1._timer, _callback, 0);
      +timer_setup(&_field1._timer, _callback, 0);
      |
      -setup_timer(&_field1._timer, _callback, 0L);
      +timer_setup(&_field1._timer, _callback, 0);
      |
      -setup_timer(&_field1._timer, _callback, 0UL);
      +timer_setup(&_field1._timer, _callback, 0);
      |
      -setup_timer(_field1._timer, _callback, 0);
      +timer_setup(_field1._timer, _callback, 0);
      |
      -setup_timer(_field1._timer, _callback, 0L);
      +timer_setup(_field1._timer, _callback, 0);
      |
      -setup_timer(_field1._timer, _callback, 0UL);
      +timer_setup(_field1._timer, _callback, 0);
      )
      
      @change_callback_unused_data
       depends on match_timer_function_unused_data@
      identifier match_timer_function_unused_data._callback;
      type _origtype;
      identifier _origarg;
      @@
      
       void _callback(
      -_origtype _origarg
      +struct timer_list *unused
       )
       {
      	... when != _origarg
       }
      Signed-off-by: NKees Cook <keescook@chromium.org>
      86cb30ec
  5. 09 11月, 2017 1 次提交
  6. 08 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates · 38c53af8
      Paul Mackerras 提交于
      Commit 5e985969 ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing
      implementation", 2016-12-20) added code that tries to exclude any use
      or update of the hashed page table (HPT) while the HPT resizing code
      is iterating through all the entries in the HPT.  It does this by
      taking the kvm->lock mutex, clearing the kvm->arch.hpte_setup_done
      flag and then sending an IPI to all CPUs in the host.  The idea is
      that any VCPU task that tries to enter the guest will see that the
      hpte_setup_done flag is clear and therefore call kvmppc_hv_setup_htab_rma,
      which also takes the kvm->lock mutex and will therefore block until
      we release kvm->lock.
      
      However, any VCPU that is already in the guest, or is handling a
      hypervisor page fault or hypercall, can re-enter the guest without
      rechecking the hpte_setup_done flag.  The IPI will cause a guest exit
      of any VCPUs that are currently in the guest, but does not prevent
      those VCPU tasks from immediately re-entering the guest.
      
      The result is that after resize_hpt_rehash_hpte() has made a HPTE
      absent, a hypervisor page fault can occur and make that HPTE present
      again.  This includes updating the rmap array for the guest real page,
      meaning that we now have a pointer in the rmap array which connects
      with pointers in the old rev array but not the new rev array.  In
      fact, if the HPT is being reduced in size, the pointer in the rmap
      array could point outside the bounds of the new rev array.  If that
      happens, we can get a host crash later on such as this one:
      
      [91652.628516] Unable to handle kernel paging request for data at address 0xd0000000157fb10c
      [91652.628668] Faulting instruction address: 0xc0000000000e2640
      [91652.628736] Oops: Kernel access of bad area, sig: 11 [#1]
      [91652.628789] LE SMP NR_CPUS=1024 NUMA PowerNV
      [91652.628847] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal ipmi_powernv ipmi_devintf i2c_core ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm scsi_dh_alua dm_service_time dm_multipath tg3 ptp pps_core [last unloaded: stap_552b612747aec2da355051e464fa72a1_14259]
      [91652.629566] CPU: 136 PID: 41315 Comm: CPU 21/KVM Tainted: G           O    4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le #1
      [91652.629684] task: c0000007a419e400 task.stack: c0000000028d8000
      [91652.629750] NIP:  c0000000000e2640 LR: d00000000c36e498 CTR: c0000000000e25f0
      [91652.629829] REGS: c0000000028db5d0 TRAP: 0300   Tainted: G           O     (4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le)
      [91652.629932] MSR:  900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 44022422  XER: 00000000
      [91652.630034] CFAR: d00000000c373f84 DAR: d0000000157fb10c DSISR: 40000000 SOFTE: 1
      [91652.630034] GPR00: d00000000c36e498 c0000000028db850 c000000001403900 c0000007b7960000
      [91652.630034] GPR04: d0000000117fb100 d000000007ab00d8 000000000033bb10 0000000000000000
      [91652.630034] GPR08: fffffffffffffe7f 801001810073bb10 d00000000e440000 d00000000c373f70
      [91652.630034] GPR12: c0000000000e25f0 c00000000fdb9400 f000000003b24680 0000000000000000
      [91652.630034] GPR16: 00000000000004fb 00007ff7081a0000 00000000000ec91a 000000000033bb10
      [91652.630034] GPR20: 0000000000010000 00000000001b1190 0000000000000001 0000000000010000
      [91652.630034] GPR24: c0000007b7ab8038 d0000000117fb100 0000000ec91a1190 c000001e6a000000
      [91652.630034] GPR28: 00000000033bb100 000000000073bb10 c0000007b7960000 d0000000157fb100
      [91652.630735] NIP [c0000000000e2640] kvmppc_add_revmap_chain+0x50/0x120
      [91652.630806] LR [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
      [91652.630884] Call Trace:
      [91652.630913] [c0000000028db850] [c0000000028db8b0] 0xc0000000028db8b0 (unreliable)
      [91652.630996] [c0000000028db8b0] [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
      [91652.631091] [c0000000028db9e0] [d00000000c36a078] kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv]
      [91652.631179] [c0000000028dbb30] [d00000000c2248c4] kvmppc_vcpu_run+0x34/0x50 [kvm]
      [91652.631266] [c0000000028dbb50] [d00000000c220d54] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm]
      [91652.631351] [c0000000028dbbd0] [d00000000c2139d8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
      [91652.631433] [c0000000028dbd40] [c0000000003832e0] do_vfs_ioctl+0xd0/0x8c0
      [91652.631501] [c0000000028dbde0] [c000000000383ba4] SyS_ioctl+0xd4/0x130
      [91652.631569] [c0000000028dbe30] [c00000000000b8e0] system_call+0x58/0x6c
      [91652.631635] Instruction dump:
      [91652.631676] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffa1 2fa70000 793d0020 e9432110
      [91652.631814] 7bbf26e4 7c7e1b78 7feafa14 409e0094 <807f000c> 786326e4 7c6a1a14 93a40008
      [91652.631959] ---[ end trace ac85ba6db72e5b2e ]---
      
      To fix this, we tighten up the way that the hpte_setup_done flag is
      checked to ensure that it does provide the guarantee that the resizing
      code needs.  In kvmppc_run_core(), we check the hpte_setup_done flag
      after disabling interrupts and refuse to enter the guest if it is
      clear (for a HPT guest).  The code that checks hpte_setup_done and
      calls kvmppc_hv_setup_htab_rma() is moved from kvmppc_vcpu_run_hv()
      to a point inside the main loop in kvmppc_run_vcpu(), ensuring that
      we don't just spin endlessly calling kvmppc_run_core() while
      hpte_setup_done is clear, but instead have a chance to block on the
      kvm->lock mutex.
      
      Finally we also check hpte_setup_done inside the region in
      kvmppc_book3s_hv_page_fault() where the HPTE is locked and we are about
      to update the HPTE, and bail out if it is clear.  If another CPU is
      inside kvm_vm_ioctl_resize_hpt_commit) and has cleared hpte_setup_done,
      then we know that either we are looking at a HPTE
      that resize_hpt_rehash_hpte() has not yet processed, which is OK,
      or else we will see hpte_setup_done clear and refuse to update it,
      because of the full barrier formed by the unlock of the HPTE in
      resize_hpt_rehash_hpte() combined with the locking of the HPTE
      in kvmppc_book3s_hv_page_fault().
      
      Fixes: 5e985969 ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation")
      Cc: stable@vger.kernel.org # v4.10+
      Reported-by: NSatheesh Rajendran <satheera@in.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      38c53af8
  7. 06 11月, 2017 1 次提交
    • N
      KVM: PPC: Book3S HV: Handle host system reset in guest mode · 6de6638b
      Nicholas Piggin 提交于
      If the host takes a system reset interrupt while a guest is running,
      the CPU must exit the guest before processing the host exception
      handler.
      
      After this patch, taking a sysrq+x with a CPU running in a guest
      gives a trace like this:
      
         cpu 0x27: Vector: 100 (System Reset) at [c000000fdf5776f0]
             pc: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
             lr: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
             sp: c000000fdf577850
            msr: 9000000002803033
           current = 0xc000000fdf4b1e00
           paca    = 0xc00000000fd4d680	 softe: 3	 irq_happened: 0x01
             pid   = 6608, comm = qemu-system-ppc
         Linux version 4.14.0-rc7-01489-g47e1893a404a-dirty #26 SMP
         [c000000fdf577a00] c008000010159dd4 kvmppc_vcpu_run_hv+0x3dc/0x12d0 [kvm_hv]
         [c000000fdf577b30] c0080000100a537c kvmppc_vcpu_run+0x44/0x60 [kvm]
         [c000000fdf577b60] c0080000100a1ae0 kvm_arch_vcpu_ioctl_run+0x118/0x310 [kvm]
         [c000000fdf577c00] c008000010093e98 kvm_vcpu_ioctl+0x530/0x7c0 [kvm]
         [c000000fdf577d50] c000000000357bf8 do_vfs_ioctl+0xd8/0x8c0
         [c000000fdf577df0] c000000000358448 SyS_ioctl+0x68/0x100
         [c000000fdf577e30] c00000000000b220 system_call+0x58/0x6c
         --- Exception: c01 (System Call) at 00007fff76868df0
         SP (7fff7069baf0) is in userspace
      
      Fixes: e36d0a2e ("powerpc/powernv: Implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6de6638b
  8. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  9. 01 11月, 2017 9 次提交
    • P
      KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts · c0101509
      Paul Mackerras 提交于
      This patch removes the restriction that a radix host can only run
      radix guests, allowing us to run HPT (hashed page table) guests as
      well.  This is useful because it provides a way to run old guest
      kernels that know about POWER8 but not POWER9.
      
      Unfortunately, POWER9 currently has a restriction that all threads
      in a given code must either all be in HPT mode, or all in radix mode.
      This means that when entering a HPT guest, we have to obtain control
      of all 4 threads in the core and get them to switch their LPIDR and
      LPCR registers, even if they are not going to run a guest.  On guest
      exit we also have to get all threads to switch LPIDR and LPCR back
      to host values.
      
      To make this feasible, we require that KVM not be in the "independent
      threads" mode, and that the CPU cores be in single-threaded mode from
      the host kernel's perspective (only thread 0 online; threads 1, 2 and
      3 offline).  That allows us to use the same code as on POWER8 for
      obtaining control of the secondary threads.
      
      To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
      struct to contain the information needed by the secondary threads.
      All threads perform a barrier synchronization (where all threads wait
      for every other thread to reach the synchronization point) on guest
      entry, both before and after loading LPCR and LPIDR.  On guest exit,
      they all once again perform a barrier synchronization both before
      and after loading host values into LPCR and LPIDR.
      
      Finally, it is also currently necessary to flush the entire TLB every
      time we enter a HPT guest on a radix host.  We do this on thread 0
      with a loop of tlbiel instructions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c0101509
    • P
      KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode · 516f7898
      Paul Mackerras 提交于
      This patch allows for a mode on POWER9 hosts where we control all the
      threads of a core, much as we do on POWER8.  The mode is controlled by
      a module parameter on the kvm_hv module, called "indep_threads_mode".
      The normal mode on POWER9 is the "independent threads" mode, with
      indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
      desired SMT mode) and each thread independently enters and exits from
      KVM guests without reference to what other threads in the core are
      doing.
      
      If indep_threads_mode is set to N at the point when a VM is started,
      KVM will expect every core that the guest runs on to be in single
      threaded mode (that is, threads 1, 2 and 3 offline), and will set the
      flag that prevents secondary threads from coming online.  We can still
      use all four threads; the code that implements dynamic micro-threading
      on POWER8 will become active in over-commit situations and will allow
      up to three other VCPUs to be run on the secondary threads of the core
      whenever a VCPU is run.
      
      The reason for wanting this mode is that this will allow us to run HPT
      guests on a radix host on a POWER9 machine that does not support
      "mixed mode", that is, having some threads in a core be in HPT mode
      while other threads are in radix mode.  It will also make it possible
      to implement a "strict threads" mode in future, if desired.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      516f7898
    • P
      KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host · 18c3640c
      Paul Mackerras 提交于
      This sets up the machinery for switching a guest between HPT (hashed
      page table) and radix MMU modes, so that in future we can run a HPT
      guest on a radix host on POWER9 machines.
      
      * The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
        radix mode, on a radix host.
      
      * The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
        with HV KVM on a radix host.
      
      * The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
        radix host.
      
      * The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
        guest to HPT mode and allocate a HPT.
      
      * For simplicity, we now allocate the rmap array for each memslot,
        even on a radix host, since it will be needed if the guest switches
        to HPT mode.
      
      * Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
        ioctl will return an EINVAL error in that case.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      18c3640c
    • P
      KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix · e641a317
      Paul Mackerras 提交于
      Currently, the HPT code in HV KVM maintains a dirty bit per guest page
      in the rmap array, whether or not dirty page tracking has been enabled
      for the memory slot.  In contrast, the radix code maintains a dirty
      bit per guest page in memslot->dirty_bitmap, and only does so when
      dirty page tracking has been enabled.
      
      This changes the HPT code to maintain the dirty bits in the memslot
      dirty_bitmap like radix does.  This results in slightly less code
      overall, and will mean that we do not lose the dirty bits when
      transitioning between HPT and radix mode in future.
      
      There is one minor change to behaviour as a result.  With HPT, when
      dirty tracking was enabled for a memslot, we would previously clear
      all the dirty bits at that point (both in the HPT entries and in the
      rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
      following would show no pages as dirty (assuming no vcpus have run
      in the meantime).  With this change, the dirty bits on HPT entries
      are not cleared at the point where dirty tracking is enabled, so
      KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
      resident in the HPT and dirty.  This is consistent with what happens
      on radix.
      
      This also fixes a bug in the mark_pages_dirty() function for radix
      (in the sense that the function no longer exists).  In the case where
      a large page of 64 normal pages or more is marked dirty, the
      addressing of the dirty bitmap was incorrect and could write past
      the end of the bitmap.  Fortunately this case was never hit in
      practice because a 2MB large page is only 32 x 64kB pages, and we
      don't support backing the guest with 1GB huge pages at this point.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e641a317
    • P
      KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_ready · 1b151ce4
      Paul Mackerras 提交于
      This renames the kvm->arch.hpte_setup_done field to mmu_ready because
      we will want to use it for radix guests too -- both for setting things
      up before vcpu execution, and for excluding vcpus from executing while
      MMU-related things get changed, such as in future switching the MMU
      from radix to HPT mode or vice-versa.
      
      This also moves the call to kvmppc_setup_partition_table() that was
      done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting
      of mmu_ready, into the caller in kvmppc_vcpu_run_hv().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1b151ce4
    • P
      KVM: PPC: Book3S HV: Don't rely on host's page size information · 8dc6cca5
      Paul Mackerras 提交于
      This removes the dependence of KVM on the mmu_psize_defs array (which
      stores information about hardware support for various page sizes) and
      the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(),
      hpte_actual_page_size() and get_sllp_encoding().  We also no longer
      rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature
      bit.
      
      The reason for doing this is so we can support a HPT guest on a radix
      host.  In a radix host, the mmu_psize_defs array contains information
      about page sizes supported by the MMU in radix mode rather than the
      page sizes supported by the MMU in HPT mode.  Similarly, mmu_slb_size
      and the MMU_FTR_1T_SEGMENTS bit are not set.
      
      Instead we hard-code knowledge of the behaviour of the HPT MMU in the
      POWER7, POWER8 and POWER9 processors (which are the only processors
      supported by HV KVM) - specifically the encoding of the LP fields in
      the HPT and SLB entries, and the fact that they have 32 SLB entries
      and support 1TB segments.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8dc6cca5
    • N
      KVM: PPC: Book3S: Fix gas warning due to using r0 as immediate 0 · 93897a1f
      Nicholas Piggin 提交于
      This fixes the message:
      
      arch/powerpc/kvm/book3s_segment.S: Assembler messages:
      arch/powerpc/kvm/book3s_segment.S:330: Warning: invalid register expression
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      93897a1f
    • G
      KVM: PPC: Book3S PR: Only install valid SLBs during KVM_SET_SREGS · f4093ee9
      Greg Kurz 提交于
      Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS,
      some of which are valid (ie, SLB_ESID_V is set) and the rest are
      likely all-zeroes (with QEMU at least).
      
      Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which
      assumes to find the SLB index in the 3 lower bits of its rb argument.
      When passed zeroed arguments, it happily overwrites the 0th SLB entry
      with zeroes. This is exactly what happens while doing live migration
      with QEMU when the destination pushes the incoming SLB descriptors to
      KVM PR. When reloading the SLBs at the next synchronization, QEMU first
      clears its SLB array and only restore valid ones, but the 0th one is
      now gone and we cannot access the corresponding memory anymore:
      
      (qemu) x/x $pc
      c0000000000b742c: Cannot access memory
      
      To avoid this, let's filter out non-valid SLB entries. While here, we
      also force a full SLB flush before installing new entries. Since SLB
      is for 64-bit only, we now build this path conditionally to avoid a
      build break on 32-bit, which doesn't define SLB_ESID_V.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f4093ee9
    • P
      KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not enabled · 00bb6ae5
      Paul Mackerras 提交于
      When running a guest on a POWER9 system with the in-kernel XICS
      emulation disabled (for example by running QEMU with the parameter
      "-machine pseries,kernel_irqchip=off"), the kernel does not pass
      the XICS-related hypercalls such as H_CPPR up to userspace for
      emulation there as it should.
      
      The reason for this is that the real-mode handlers for these
      hypercalls don't check whether a XICS device has been instantiated
      before calling the xics-on-xive code.  That code doesn't check
      either, leading to potential NULL pointer dereferences because
      vcpu->arch.xive_vcpu is NULL.  Those dereferences won't cause an
      exception in real mode but will lead to kernel memory corruption.
      
      This fixes it by adding kvmppc_xics_enabled() checks before calling
      the XICS functions.
      
      Cc: stable@vger.kernel.org # v4.11+
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      00bb6ae5
  10. 20 10月, 2017 1 次提交
    • M
      KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM feature · 2a3d6553
      Michael Ellerman 提交于
      Currently we use CPU_FTR_TM to decide if the CPU/kernel can support
      TM (Transactional Memory), and if it's true we advertise that to
      Qemu (or similar) via KVM_CAP_PPC_HTM.
      
      PPC_FEATURE2_HTM is the user-visible feature bit, which indicates that
      the CPU and kernel can support TM. Currently CPU_FTR_TM and
      PPC_FEATURE2_HTM always have the same value, either true or false, so
      using the former for KVM_CAP_PPC_HTM is correct.
      
      However some Power9 CPUs can operate in a mode where TM is enabled but
      TM suspended state is disabled. In this mode CPU_FTR_TM is true, but
      PPC_FEATURE2_HTM is false. Instead a different PPC_FEATURE2 bit is
      set, to indicate that this different mode of TM is available.
      
      It is not safe to let guests use TM as-is, when the CPU is in this
      mode. So to prevent that from happening, use PPC_FEATURE2_HTM to
      determine the value of KVM_CAP_PPC_HTM.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2a3d6553
  11. 19 10月, 2017 1 次提交
  12. 16 10月, 2017 2 次提交
    • B
      KVM: PPC: Book3S HV: Add more barriers in XIVE load/unload code · ad98dd1a
      Benjamin Herrenschmidt 提交于
      On POWER9 systems, we push the VCPU context onto the XIVE (eXternal
      Interrupt Virtualization Engine) hardware when entering a guest,
      and pull the context off the XIVE when exiting the guest.  The push
      is done with cache-inhibited stores, and the pull with cache-inhibited
      loads.
      
      Testing has revealed that it is possible (though very rare) for
      the stores to get reordered with the loads so that we end up with the
      guest VCPU context still loaded on the XIVE after we have exited the
      guest.  When that happens, it is possible for the same VCPU context
      to then get loaded on another CPU, which causes the machine to
      checkstop.
      
      To fix this, we add I/O barrier instructions (eieio) before and
      after the push and pull operations.  As partial compensation for the
      potential slowdown caused by the extra barriers, we remove the eieio
      instructions between the two stores in the push operation, and between
      the two loads in the pull operation.  (The architecture requires
      loads to cache-inhibited, guarded storage to be kept in order, and
      requires stores to cache-inhibited, guarded storage likewise to be
      kept in order, but allows such loads and stores to be reordered with
      respect to each other.)
      Reported-by: NCarol L Soto <clsoto@us.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ad98dd1a
    • P
      KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guests · 891f1ebf
      Paul Mackerras 提交于
      This adds code to make sure that we don't try to access the
      non-existent HPT for a radix guest using the htab file for the VM
      in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD
      ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls.
      
      At present nothing bad happens if userspace does access these
      interfaces on a radix guest, mostly because kvmppc_hpt_npte()
      gives 0 for a radix guest, which in turn is because 1 << -4
      comes out as 0 on POWER processors.  However, that relies on
      undefined behaviour, so it is better to be explicit about not
      accessing the HPT for a radix guest.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      891f1ebf
  13. 14 10月, 2017 8 次提交
    • A
      KVM: PPC: Book3S PR: Enable in-kernel TCE handlers for PR KVM · 3f2bb764
      Alexey Kardashevskiy 提交于
      The handlers support PR KVM from the day one; however the PR KVM's
      enable/disable hcalls handler missed these ones.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3f2bb764
    • M
      KVM: PPC: Book3S HV: Delete an error message for a failed memory allocation in... · 9c7e53dc
      Markus Elfring 提交于
      KVM: PPC: Book3S HV: Delete an error message for a failed memory allocation in kvmppc_allocate_hpt()
      
      Omit an extra message for a memory allocation failure in this function.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      9c7e53dc
    • T
      KVM: PPC: BookE: Use vma_pages function · 4bdcb701
      Thomas Meyer 提交于
      Use vma_pages function on vma object instead of explicit computation.
      Found by coccinelle spatch "api/vma_pages.cocci"
      Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4bdcb701
    • T
      KVM: PPC: Book3S HV: Use ARRAY_SIZE macro · 4bb817ed
      Thomas Meyer 提交于
      Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
      yourself.
      Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
      's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
      /ARRAY_SIZE(\1)/g' and manual check/verification.
      Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4bb817ed
    • P
      KVM: PPC: Book3S HV: Handle unexpected interrupts better · 857b99e1
      Paul Mackerras 提交于
      At present, if an interrupt (i.e. an exception or trap) occurs in the
      code where KVM is switching the MMU to or from guest context, we jump
      to kvmppc_bad_host_intr, where we simply spin with interrupts disabled.
      In this situation, it is hard to debug what happened because we get no
      indication as to which interrupt occurred or where.  Typically we get
      a cascade of stall and soft lockup warnings from other CPUs.
      
      In order to get more information for debugging, this adds code to
      create a stack frame on the emergency stack and save register values
      to it.  We start half-way down the emergency stack in order to give
      ourselves some chance of being able to do a stack trace on secondary
      threads that are already on the emergency stack.
      
      On POWER7 or POWER8, we then just spin, as before, because we don't
      know what state the MMU context is in or what other threads are doing,
      and we can't switch back to host context without coordinating with
      other threads.  On POWER9 we can do better; there we load up the host
      MMU context and jump to C code, which prints an oops message to the
      console and panics.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      857b99e1
    • A
      KVM: PPC: Book3S: Protect kvmppc_gpa_to_ua() with SRCU · 8f6a9f0d
      Alexey Kardashevskiy 提交于
      kvmppc_gpa_to_ua() accesses KVM memory slot array via
      srcu_dereference_check() and this produces warnings from RCU like below.
      
      This extends the existing srcu_read_lock/unlock to cover that
      kvmppc_gpa_to_ua() as well.
      
      We did not hit this before as this lock is not needed for the realmode
      handlers and hash guests would use the realmode path all the time;
      however the radix guests are always redirected to the virtual mode
      handlers and hence the warning.
      
      [   68.253798] ./include/linux/kvm_host.h:575 suspicious rcu_dereference_check() usage!
      [   68.253799]
                     other info that might help us debug this:
      
      [   68.253802]
                     rcu_scheduler_active = 2, debug_locks = 1
      [   68.253804] 1 lock held by qemu-system-ppc/6413:
      [   68.253806]  #0:  (&vcpu->mutex){+.+.}, at: [<c00800000e3c22f4>] vcpu_load+0x3c/0xc0 [kvm]
      [   68.253826]
                     stack backtrace:
      [   68.253830] CPU: 92 PID: 6413 Comm: qemu-system-ppc Tainted: G        W       4.14.0-rc3-00553-g432dcba58e9c-dirty #72
      [   68.253833] Call Trace:
      [   68.253839] [c000000fd3d9f790] [c000000000b7fcc8] dump_stack+0xe8/0x160 (unreliable)
      [   68.253845] [c000000fd3d9f7d0] [c0000000001924c0] lockdep_rcu_suspicious+0x110/0x180
      [   68.253851] [c000000fd3d9f850] [c0000000000e825c] kvmppc_gpa_to_ua+0x26c/0x2b0
      [   68.253858] [c000000fd3d9f8b0] [c00800000e3e1984] kvmppc_h_put_tce+0x12c/0x2a0 [kvm]
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8f6a9f0d
    • N
      KVM: PPC: Book3S HV: POWER9 more doorbell fixes · 2cde3716
      Nicholas Piggin 提交于
      - Add another case where msgsync is required.
      - Required barrier sequence for global doorbells is msgsync ; lwsync
      
      When msgsnd is used for IPIs to other cores, msgsync must be executed by
      the target to order stores performed on the source before its msgsnd
      (provided the source executes the appropriate sync).
      
      Fixes: 1704a81c ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2cde3716
    • G
      KVM: PPC: Fix oops when checking KVM_CAP_PPC_HTM · ac64115a
      Greg Kurz 提交于
      The following program causes a kernel oops:
      
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <sys/ioctl.h>
      #include <linux/kvm.h>
      
      main()
      {
          int fd = open("/dev/kvm", O_RDWR);
          ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM);
      }
      
      This happens because when using the global KVM fd with
      KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets
      called with a NULL kvm argument, which gets dereferenced
      in is_kvmppc_hv_enabled(). Spotted while reading the code.
      
      Let's use the hv_enabled fallback variable, like everywhere
      else in this function.
      
      Fixes: 23528bb2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM")
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ac64115a
  14. 03 10月, 2017 1 次提交
  15. 22 9月, 2017 1 次提交
    • M
      KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception · e001fa78
      Michael Neuling 提交于
      On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
      Interrupt (HDSI) the HDSISR is not be updated at all.
      
      To work around this we put a canary value into the HDSISR before
      returning to a guest and then check for this canary when we take a
      HDSI. If we find the canary on a HDSI, we know the hardware didn't
      update the HDSISR. In this case we return to the guest to retake the
      HDSI which should correctly update the HDSISR the second time HDSI
      entry.
      
      After talking to Paulus we've applied this workaround to all POWER9
      CPUs. The workaround of returning to the guest shouldn't ever be
      triggered on well behaving CPU. The extra instructions should have
      negligible performance impact.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e001fa78
  16. 15 9月, 2017 1 次提交
  17. 12 9月, 2017 3 次提交
    • P
      KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly · 67f8a8c1
      Paul Mackerras 提交于
      Aneesh Kumar reported seeing host crashes when running recent kernels
      on POWER8.  The symptom was an oops like this:
      
      Unable to handle kernel paging request for data at address 0xf00000000786c620
      Faulting instruction address: 0xc00000000030e1e4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: powernv_op_panel
      CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
      task: c000000fdeadfe80 task.stack: c000000fdeb68000
      NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
      REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
      CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
      GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
      GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
      GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
      GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
      GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
      GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
      GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
      GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
      NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
      LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
      Call Trace:
      [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
      [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
      [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
      [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
      [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
      [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
      [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
      [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
      [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
      [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
      Instruction dump:
      7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
      794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
      ---[ end trace fad4a342d0414aa2 ]---
      
      It turns out that what has happened is that the SLB entry for the
      vmmemap region hasn't been reloaded on exit from a guest, and it has
      the wrong page size.  Then, when the host next accesses the vmemmap
      region, it gets a page fault.
      
      Commit a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with
      KVM", 2017-07-24) modified the guest exit code so that it now only clears
      out the SLB for hash guest.  The code tests the radix flag and puts the
      result in a non-volatile CR field, CR2, and later branches based on CR2.
      
      Unfortunately, the kvmppc_save_tm function, which gets called between
      those two points, modifies all the user-visible registers in the case
      where the guest was in transactional or suspended state, except for a
      few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.
      
      This fixes the problem by re-doing the comparison just before the
      result is needed.  For good measure, this also adds comments next to
      the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
      that non-volatile register state will be lost.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM")
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      67f8a8c1
    • P
      KVM: PPC: Book3S HV: Hold kvm->lock around call to kvmppc_update_lpcr · cf5f6f31
      Paul Mackerras 提交于
      Commit 468808bd ("KVM: PPC: Book3S HV: Set process table for HPT
      guests on POWER9", 2017-01-30) added a call to kvmppc_update_lpcr()
      which doesn't hold the kvm->lock mutex around the call, as required.
      This adds the lock/unlock pair, and for good measure, includes
      the kvmppc_setup_partition_table() call in the locked region, since
      it is altering global state of the VM.
      
      This error appears not to have any fatal consequences for the host;
      the consequences would be that the VCPUs could end up running with
      different LPCR values, or an update to the LPCR value by userspace
      using the one_reg interface could get overwritten, or the update
      done by kvmhv_configure_mmu() could get overwritten.
      
      Cc: stable@vger.kernel.org # v4.10+
      Fixes: 468808bd ("KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      cf5f6f31
    • B
      KVM: PPC: Book3S HV: Don't access XIVE PIPR register using byte accesses · d222af07
      Benjamin Herrenschmidt 提交于
      The XIVE interrupt controller on POWER9 machines doesn't support byte
      accesses to any register in the thread management area other than the
      CPPR (current processor priority register).  In particular, when
      reading the PIPR (pending interrupt priority register), we need to
      do a 32-bit or 64-bit load.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: 2c4fb78f ("KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss")
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d222af07
  18. 01 9月, 2017 1 次提交
  19. 31 8月, 2017 4 次提交