1. 26 6月, 2020 1 次提交
  2. 25 6月, 2020 3 次提交
  3. 23 6月, 2020 8 次提交
    • S
      KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru · e4553b49
      Sean Christopherson 提交于
      Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was
      moved to common x86 code.
      
      No functional change intended.
      
      Fixes: 37486135 ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200617034123.25647-1-sean.j.christopherson@intel.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e4553b49
    • M
      KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling · 26769f96
      Marcelo Tosatti 提交于
      The Linux TSC calibration procedure is subject to small variations
      (its common to see +-1 kHz difference between reboots on a given CPU, for example).
      
      So migrating a guest between two hosts with identical processor can fail, in case
      of a small variation in calibrated TSC between them.
      
      Without TSC scaling, the current kernel interface will either return an error
      (if user_tsc_khz <= tsc_khz) or enable TSC catchup mode.
      
      This change enables the following TSC tolerance check to
      accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default).
      
              /*
               * Compute the variation in TSC rate which is acceptable
               * within the range of tolerance and decide if the
               * rate being applied is within that bounds of the hardware
               * rate.  If so, no scaling or compensation need be done.
               */
              thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
              thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
              if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
                      pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi);
                      use_scaling = 1;
              }
      
      NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm).
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      
      Message-Id: <20200616114741.GA298183@fuller.cnet>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      26769f96
    • X
      KVM: X86: Fix MSR range of APIC registers in X2APIC mode · bf10bd0b
      Xiaoyao Li 提交于
      Only MSR address range 0x800 through 0x8ff is architecturally reserved
      and dedicated for accessing APIC registers in x2APIC mode.
      
      Fixes: 0105d1a5 ("KVM: x2apic interface to lapic")
      Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20200616073307.16440-1-xiaoyao.li@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf10bd0b
    • S
      KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL · bf09fb6c
      Sean Christopherson 提交于
      Remove support for context switching between the guest's and host's
      desired UMWAIT_CONTROL.  Propagating the guest's value to hardware isn't
      required for correct functionality, e.g. KVM intercepts reads and writes
      to the MSR, and the latency effects of the settings controlled by the
      MSR are not architecturally visible.
      
      As a general rule, KVM should not allow the guest to control power
      management settings unless explicitly enabled by userspace, e.g. see
      KVM_CAP_X86_DISABLE_EXITS.  E.g. Intel's SDM explicitly states that C0.2
      can improve the performance of SMT siblings.  A devious guest could
      disable C0.2 so as to improve the performance of their workloads at the
      detriment to workloads running in the host or on other VMs.
      
      Wholesale removal of UMWAIT_CONTROL context switching also fixes a race
      condition where updates from the host may cause KVM to enter the guest
      with the incorrect value.  Because updates are are propagated to all
      CPUs via IPI (SMP function callback), the value in hardware may be
      stale with respect to the cached value and KVM could enter the guest
      with the wrong value in hardware.  As above, the guest can't observe the
      bad value, but it's a weird and confusing wart in the implementation.
      
      Removal also fixes the unnecessary usage of VMX's atomic load/store MSR
      lists.  Using the lists is only necessary for MSRs that are required for
      correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on
      old hardware, or for MSRs that need to-the-uop precision, e.g. perf
      related MSRs.  For UMWAIT_CONTROL, the effects are only visible in the
      kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in
      vcpu_vmx_run().  Using the atomic lists is undesirable as they are more
      expensive than direct RDMSR/WRMSR.
      
      Furthermore, even if giving the guest control of the MSR is legitimate,
      e.g. in pass-through scenarios, it's not clear that the benefits would
      outweigh the overhead.  E.g. saving and restoring an MSR across a VMX
      roundtrip costs ~250 cycles, and if the guest diverged from the host
      that cost would be paid on every run of the guest.  In other words, if
      there is a legitimate use case then it should be enabled by a new
      per-VM capability.
      
      Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can
      correctly expose other WAITPKG features to the guest, e.g. TPAUSE,
      UMWAIT and UMONITOR.
      
      Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Cc: Jingqi Liu <jingqi.liu@intel.com>
      Cc: Tao Xu <tao3.xu@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf09fb6c
    • S
      KVM: nVMX: Plumb L2 GPA through to PML emulation · 2dbebf7a
      Sean Christopherson 提交于
      Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
      intents and purposes is vmx_write_pml_buffer(), instead of having the
      latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS.  If the dirty bit
      update is the result of KVM emulation (rare for L2), then the GPA in the
      VMCS may be stale and/or hold a completely unrelated GPA.
      
      Fixes: c5f983f6 ("nVMX: Implement emulated Page Modification Logging")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2dbebf7a
    • V
      KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic() · 312d16c7
      Vitaly Kuznetsov 提交于
      translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously
      wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and
      we don't use the value in 'real_gfn' as a GFN, we do
      
       real_gfn = gpa_to_gfn(real_gfn);
      
      instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust
      your eyes', but let's fix it for good.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200622151435.752560-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      312d16c7
    • P
      KVM: LAPIC: ensure APIC map is up to date on concurrent update requests · 44d52717
      Paolo Bonzini 提交于
      The following race can cause lost map update events:
      
               cpu1                            cpu2
      
                                      apic_map_dirty = true
        ------------------------------------------------------------
                                      kvm_recalculate_apic_map:
                                           pass check
                                               mutex_lock(&kvm->arch.apic_map_lock);
                                               if (!kvm->arch.apic_map_dirty)
                                           and in process of updating map
        -------------------------------------------------------------
          other calls to
             apic_map_dirty = true         might be too late for affected cpu
        -------------------------------------------------------------
                                           apic_map_dirty = false
        -------------------------------------------------------------
          kvm_recalculate_apic_map:
          bail out on
            if (!kvm->arch.apic_map_dirty)
      
      To fix it, record the beginning of an update of the APIC map in
      apic_map_dirty.  If another APIC map change switches apic_map_dirty
      back to DIRTY during the update, kvm_recalculate_apic_map should not
      make it CLEAN, and the other caller will go through the slow path.
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44d52717
    • I
      kvm: lapic: fix broken vcpu hotplug · af28dfac
      Igor Mammedov 提交于
      Guest fails to online hotplugged CPU with error
        smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
      
      It's caused by the fact that kvm_apic_set_state(), which used to call
      recalculate_apic_map() unconditionally and pulled hotplugged CPU into
      apic map, is updating map conditionally on state changes.  In this case
      the APIC map is not considered dirty and the is not updated.
      
      Fix the issue by forcing unconditional update from kvm_apic_set_state(),
      like it used to be.
      
      Fixes: 4abaffce ("KVM: LAPIC: Recalculate apic map in batch")
      Cc: stable@vger.kernel.org
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Message-Id: <20200622160830.426022-1-imammedo@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      af28dfac
  4. 20 6月, 2020 1 次提交
    • M
      x86/asm/64: Align start of __clear_user() loop to 16-bytes · bb5570ad
      Matt Fleming 提交于
      x86 CPUs can suffer severe performance drops if a tight loop, such as
      the ones in __clear_user(), straddles a 16-byte instruction fetch
      window, or worse, a 64-byte cacheline. This issues was discovered in the
      SUSE kernel with the following commit,
      
        11539337 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
      
      which increased the code object size from 10 bytes to 15 bytes and
      caused the 8-byte copy loop in __clear_user() to be split across a
      64-byte cacheline.
      
      Aligning the start of the loop to 16-bytes makes this fit neatly inside
      a single instruction fetch window again and restores the performance of
      __clear_user() which is used heavily when reading from /dev/zero.
      
      Here are some numbers from running libmicro's read_z* and pread_z*
      microbenchmarks which read from /dev/zero:
      
        Zen 1 (Naples)
      
        libmicro-file
                                              5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                                          revert-11539337+               align16+
        Time mean95-pread_z100k       9.9195 (   0.00%)      5.9856 (  39.66%)      5.9938 (  39.58%)
        Time mean95-pread_z10k        1.1378 (   0.00%)      0.7450 (  34.52%)      0.7467 (  34.38%)
        Time mean95-pread_z1k         0.2623 (   0.00%)      0.2251 (  14.18%)      0.2252 (  14.15%)
        Time mean95-pread_zw100k      9.9974 (   0.00%)      6.0648 (  39.34%)      6.0756 (  39.23%)
        Time mean95-read_z100k        9.8940 (   0.00%)      5.9885 (  39.47%)      5.9994 (  39.36%)
        Time mean95-read_z10k         1.1394 (   0.00%)      0.7483 (  34.33%)      0.7482 (  34.33%)
      
      Note that this doesn't affect Haswell or Broadwell microarchitectures
      which seem to avoid the alignment issue by executing the loop straight
      out of the Loop Stream Detector (verified using perf events).
      
      Fixes: 11539337 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org> # v4.19+
      Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@codeblueprint.co.uk
      bb5570ad
  5. 19 6月, 2020 4 次提交
    • V
      Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU" · 49097762
      Vitaly Kuznetsov 提交于
      Guest crashes are observed on a Cascade Lake system when 'perf top' is
      launched on the host, e.g.
      
       BUG: unable to handle kernel paging request at fffffe0000073038
       PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120
       Oops: 0000 [#1] SMP PTI
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380
      ...
       Call Trace:
        serial8250_console_write+0xfe/0x1f0
        call_console_drivers.constprop.0+0x9d/0x120
        console_unlock+0x1ea/0x460
      
      Call traces are different but the crash is imminent. The problem was
      blindly bisected to the commit 041bc42c ("KVM: VMX: Micro-optimize
      vmexit time when not exposing PMU"). It was also confirmed that the
      issue goes away if PMU is exposed to the guest.
      
      With some instrumentation of the guest we can see what is being switched
      (when we do atomic_switch_perf_msrs()):
      
       vmx_vcpu_run: switching 2 msrs
       vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f
       vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2
      
      The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame.
      Regardless of whether PMU is exposed to the guest or not, PEBS needs to
      be disabled upon switch.
      
      This reverts commit 041bc42c.
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200619094046.654019-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49097762
    • W
      x86/platform/intel-mid: convert to use i2c_new_client_device() · f04a5ba1
      Wolfram Sang 提交于
      Move away from the deprecated API and return the shiny new ERRPTR where
      useful.
      Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
      Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NWolfram Sang <wsa@kernel.org>
      f04a5ba1
    • L
      maccess: make get_kernel_nofault() check for minimal type compatibility · 0c389d89
      Linus Torvalds 提交于
      Now that we've renamed probe_kernel_address() to get_kernel_nofault()
      and made it look and behave more in line with get_user(), some of the
      subtle type behavior differences end up being more obvious and possibly
      dangerous.
      
      When you do
      
              get_user(val, user_ptr);
      
      the type of the access comes from the "user_ptr" part, and the above
      basically acts as
      
              val = *user_ptr;
      
      by design (except, of course, for the fact that the actual dereference
      is done with a user access).
      
      Note how in the above case, the type of the end result comes from the
      pointer argument, and then the value is cast to the type of 'val' as
      part of the assignment.
      
      So the type of the pointer is ultimately the more important type both
      for the access itself.
      
      But 'get_kernel_nofault()' may now _look_ similar, but it behaves very
      differently.  When you do
      
              get_kernel_nofault(val, kernel_ptr);
      
      it behaves like
      
              val = *(typeof(val) *)kernel_ptr;
      
      except, of course, for the fact that the actual dereference is done with
      exception handling so that a faulting access is suppressed and returned
      as the error code.
      
      But note how different the casting behavior of the two superficially
      similar accesses are: one does the actual access in the size of the type
      the pointer points to, while the other does the access in the size of
      the target, and ignores the pointer type entirely.
      
      Actually changing get_kernel_nofault() to act like get_user() is almost
      certainly the right thing to do eventually, but in the meantime this
      patch adds logit to at least verify that the pointer type is compatible
      with the type of the result.
      
      In many cases, this involves just casting the pointer to 'void *' to
      make it obvious that the type of the pointer is not the important part.
      It's not how 'get_user()' acts, but at least the behavioral difference
      is now obvious and explicit.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c389d89
    • C
      maccess: rename probe_kernel_address to get_kernel_nofault · 25f12ae4
      Christoph Hellwig 提交于
      Better describe what this helper does, and match the naming of
      copy_from_kernel_nofault.
      
      Also switch the argument order around, so that it acts and looks
      like get_user().
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25f12ae4
  6. 18 6月, 2020 3 次提交
  7. 17 6月, 2020 4 次提交
    • A
      efi/x86: Setup stack correctly for efi_pe_entry · 41d90b0c
      Arvind Sankar 提交于
      Commit
      
        17054f49 ("efi/x86: Implement mixed mode boot without the handover protocol")
      
      introduced a new entry point for the EFI stub to be booted in mixed mode
      on 32-bit firmware.
      
      When entered via efi32_pe_entry, control is first transferred to
      startup_32 to setup for the switch to long mode, and then the EFI stub
      proper is entered via efi_pe_entry. efi_pe_entry is an MS ABI function,
      and the ABI requires 32 bytes of shadow stack space to be allocated by
      the caller, as well as the stack being aligned to 8 mod 16 on entry.
      
      Allocate 40 bytes on the stack before switching to 64-bit mode when
      calling efi_pe_entry to account for this.
      
      For robustness, explicitly align boot_stack_end to 16 bytes. It is
      currently implicitly aligned since .bss is cacheline-size aligned,
      head_64.o is the first object file with a .bss section, and the heap and
      boot sizes are aligned.
      
      Fixes: 17054f49 ("efi/x86: Implement mixed mode boot without the handover protocol")
      Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
      Link: https://lore.kernel.org/r/20200617131957.2507632-1-nivedita@alum.mit.eduSigned-off-by: NArd Biesheuvel <ardb@kernel.org>
      41d90b0c
    • D
      x86/resctrl: Fix a NULL vs IS_ERR() static checker warning in rdt_cdp_peer_get() · cc5277fe
      Dan Carpenter 提交于
      The callers don't expect *d_cdp to be set to an error pointer, they only
      check for NULL.  This leads to a static checker warning:
      
        arch/x86/kernel/cpu/resctrl/rdtgroup.c:2648 __init_one_rdt_domain()
        warn: 'd_cdp' could be an error pointer
      
      This would not trigger a bug in this specific case because
      __init_one_rdt_domain() calls it with a valid domain that would not have
      a negative id and thus not trigger the return of the ERR_PTR(). If this
      was a negative domain id then the call to rdt_find_domain() in
      domain_add_cpu() would have returned the ERR_PTR() much earlier and the
      creation of the domain with an invalid id would have been prevented.
      
      Even though a bug is not triggered currently the right and safe thing to
      do is to set the pointer to NULL because that is what can be checked for
      when the caller is handling the CDP and non-CDP cases.
      
      Fixes: 52eb7433 ("x86/resctrl: Fix rdt_find_domain() return value and checks")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NReinette Chatre <reinette.chatre@intel.com>
      Acked-by: NFenghua Yu <fenghua.yu@intel.com>
      Link: https://lkml.kernel.org/r/20200602193611.GA190851@mwanda
      cc5277fe
    • J
      kretprobe: Prevent triggering kretprobe from within kprobe_flush_task · 9b38cc70
      Jiri Olsa 提交于
      Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave.
      My test was also able to trigger lockdep output:
      
       ============================================
       WARNING: possible recursive locking detected
       5.6.0-rc6+ #6 Not tainted
       --------------------------------------------
       sched-messaging/2767 is trying to acquire lock:
       ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0
      
       but task is already holding lock:
       ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&(kretprobe_table_locks[i].lock));
         lock(&(kretprobe_table_locks[i].lock));
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       1 lock held by sched-messaging/2767:
        #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50
      
       stack backtrace:
       CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6
       Call Trace:
        dump_stack+0x96/0xe0
        __lock_acquire.cold.57+0x173/0x2b7
        ? native_queued_spin_lock_slowpath+0x42b/0x9e0
        ? lockdep_hardirqs_on+0x590/0x590
        ? __lock_acquire+0xf63/0x4030
        lock_acquire+0x15a/0x3d0
        ? kretprobe_hash_lock+0x52/0xa0
        _raw_spin_lock_irqsave+0x36/0x70
        ? kretprobe_hash_lock+0x52/0xa0
        kretprobe_hash_lock+0x52/0xa0
        trampoline_handler+0xf8/0x940
        ? kprobe_fault_handler+0x380/0x380
        ? find_held_lock+0x3a/0x1c0
        kretprobe_trampoline+0x25/0x50
        ? lock_acquired+0x392/0xbc0
        ? _raw_spin_lock_irqsave+0x50/0x70
        ? __get_valid_kprobe+0x1f0/0x1f0
        ? _raw_spin_unlock_irqrestore+0x3b/0x40
        ? finish_task_switch+0x4b9/0x6d0
        ? __switch_to_asm+0x34/0x70
        ? __switch_to_asm+0x40/0x70
      
      The code within the kretprobe handler checks for probe reentrancy,
      so we won't trigger any _raw_spin_lock_irqsave probe in there.
      
      The problem is in outside kprobe_flush_task, where we call:
      
        kprobe_flush_task
          kretprobe_table_lock
            raw_spin_lock_irqsave
              _raw_spin_lock_irqsave
      
      where _raw_spin_lock_irqsave triggers the kretprobe and installs
      kretprobe_trampoline handler on _raw_spin_lock_irqsave return.
      
      The kretprobe_trampoline handler is then executed with already
      locked kretprobe_table_locks, and first thing it does is to
      lock kretprobe_table_locks ;-) the whole lockup path like:
      
        kprobe_flush_task
          kretprobe_table_lock
            raw_spin_lock_irqsave
              _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed
      
              ---> kretprobe_table_locks locked
      
              kretprobe_trampoline
                trampoline_handler
                  kretprobe_hash_lock(current, &head, &flags);  <--- deadlock
      
      Adding kprobe_busy_begin/end helpers that mark code with fake
      probe installed to prevent triggering of another kprobe within
      this code.
      
      Using these helpers in kprobe_flush_task, so the probe recursion
      protection check is hit and the probe is never set to prevent
      above lockup.
      
      Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2
      
      Fixes: ef53d9c5 ("kprobes: improve kretprobe scalability with hashed locking")
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Reported-by: N"Ziqian SUN (Zamir)" <zsun@redhat.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      9b38cc70
    • A
      x86/purgatory: Add -fno-stack-protector · ff58155c
      Arvind Sankar 提交于
      The purgatory Makefile removes -fstack-protector options if they were
      configured in, but does not currently add -fno-stack-protector.
      
      If gcc was configured with the --enable-default-ssp configure option,
      this results in the stack protector still being enabled for the
      purgatory (absent distro-specific specs files that might disable it
      again for freestanding compilations), if the main kernel is being
      compiled with stack protection enabled (if it's disabled for the main
      kernel, the top-level Makefile will add -fno-stack-protector).
      
      This will break the build since commit
        e4160b2e ("x86/purgatory: Fail the build if purgatory.ro has missing symbols")
      and prior to that would have caused runtime failure when trying to use
      kexec.
      
      Explicitly add -fno-stack-protector to avoid this, as done in other
      Makefiles that need to disable the stack protector.
      Reported-by: NGabriel C <nix.or.die@googlemail.com>
      Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff58155c
  8. 16 6月, 2020 1 次提交
  9. 15 6月, 2020 6 次提交
  10. 14 6月, 2020 1 次提交
    • M
      treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624
      Masahiro Yamada 提交于
      Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
      '---help---'"), the number of '---help---' has been gradually
      decreasing, but there are still more than 2400 instances.
      
      This commit finishes the conversion. While I touched the lines,
      I also fixed the indentation.
      
      There are a variety of indentation styles found.
      
        a) 4 spaces + '---help---'
        b) 7 spaces + '---help---'
        c) 8 spaces + '---help---'
        d) 1 space + 1 tab + '---help---'
        e) 1 tab + '---help---'    (correct indentation)
        f) 1 tab + 1 space + '---help---'
        g) 1 tab + 2 spaces + '---help---'
      
      In order to convert all of them to 1 tab + 'help', I ran the
      following commend:
      
        $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      a7f7f624
  11. 13 6月, 2020 1 次提交
    • T
      x86/entry: Force rcu_irq_enter() when in idle task · 0bf3924b
      Thomas Gleixner 提交于
      The idea of conditionally calling into rcu_irq_enter() only when RCU is
      not watching turned out to be not completely thought through.
      
      Paul noticed occasional premature end of grace periods in RCU torture
      testing. Bisection led to the commit which made the invocation of
      rcu_irq_enter() conditional on !rcu_is_watching().
      
      It turned out that this conditional breaks RCU assumptions about the idle
      task when the scheduler tick happens to be a nested interrupt. Nested
      interrupts can happen when the first interrupt invokes softirq processing
      on return which enables interrupts.
      
      If that nested tick interrupt does not invoke rcu_irq_enter() then the
      RCU's irq-nesting checks will believe that this interrupt came directly
      from idle, which will cause RCU to report a quiescent state.  Because this
      interrupt instead came from a softirq handler which might have been
      executing an RCU read-side critical section, this can cause the grace
      period to end prematurely.
      
      Change the condition from !rcu_is_watching() to is_idle_task(current) which
      enforces that interrupts in the idle task unconditionally invoke
      rcu_irq_enter() independent of the RCU state.
      
      This is also correct vs. user mode entries in NOHZ full scenarios because
      user mode entries bring RCU out of EQS and force the RCU irq nesting state
      accounting to nested. As only the first interrupt can enter from user mode
      a nested tick interrupt will enter from kernel mode and as the nesting
      state accounting is forced to nesting it will not do anything stupid even
      if rcu_irq_enter() has not been invoked.
      
      Fixes: 3eeec385 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
      Reported-by: N"Paul E. McKenney" <paulmck@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: N"Paul E. McKenney" <paulmck@kernel.org>
      Reviewed-by: N"Paul E. McKenney" <paulmck@kernel.org>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NFrederic Weisbecker <frederic@kernel.org>
      Link: https://lkml.kernel.org/r/87wo4cxubv.fsf@nanos.tec.linutronix.de
      0bf3924b
  12. 12 6月, 2020 6 次提交
  13. 11 6月, 2020 1 次提交
    • S
      KVM: nVMX: Consult only the "basic" exit reason when routing nested exit · 2ebac8bb
      Sean Christopherson 提交于
      Consult only the basic exit reason, i.e. bits 15:0 of vmcs.EXIT_REASON,
      when determining whether a nested VM-Exit should be reflected into L1 or
      handled by KVM in L0.
      
      For better or worse, the switch statement in nested_vmx_exit_reflected()
      currently defaults to "true", i.e. reflects any nested VM-Exit without
      dedicated logic.  Because the case statements only contain the basic
      exit reason, any VM-Exit with modifier bits set will be reflected to L1,
      even if KVM intended to handle it in L0.
      
      Practically speaking, this only affects EXIT_REASON_MCE_DURING_VMENTRY,
      i.e. a #MC that occurs on nested VM-Enter would be incorrectly routed to
      L1, as "failed VM-Entry" is the only modifier that KVM can currently
      encounter.  The SMM modifiers will never be generated as KVM doesn't
      support/employ a SMI Transfer Monitor.  Ditto for "exit from enclave",
      as KVM doesn't yet support virtualizing SGX, i.e. it's impossible to
      enter an enclave in a KVM guest (L1 or L2).
      
      Fixes: 644d711a ("KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Xiaoyao Li <xiaoyao.li@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200227174430.26371-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ebac8bb