1. 19 5月, 2016 1 次提交
  2. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  3. 20 4月, 2016 1 次提交
  4. 08 4月, 2016 1 次提交
  5. 02 4月, 2016 1 次提交
    • N
      mm/rmap: batched invalidations should use existing api · 858eaaa7
      Nadav Amit 提交于
      The recently introduced batched invalidations mechanism uses its own
      mechanism for shootdown.  However, it does wrong accounting of
      interrupts (e.g., inc_irq_stat is called for local invalidations),
      trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
      may break some platforms as it bypasses the invalidation mechanisms of
      Xen and SGI UV.
      
      This patch reuses the existing TLB flushing mechnaisms instead.  We use
      NULL as mm to indicate a global invalidation is required.
      
      Fixes 72b252ae ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858eaaa7
  6. 01 4月, 2016 1 次提交
    • P
      KVM: x86: reduce default value of halt_poll_ns parameter · 14ebda33
      Paolo Bonzini 提交于
      Windows lets applications choose the frequency of the timer tick,
      and in Windows 10 the maximum rate was changed from 1024 Hz to
      2048 Hz.  Unfortunately, because of the way the Windows API
      works, most applications who need a higher rate than the default
      64 Hz will just do
      
         timeGetDevCaps(&tc, sizeof(tc));
         timeBeginPeriod(tc.wPeriodMin);
      
      and pick the maximum rate.  This causes very high CPU usage when
      playing media or games on Windows 10, even if the guest does not
      actually use the CPU very much, because the frequent timer tick
      causes halt_poll_ns to kick in.
      
      There is no really good solution, especially because Microsoft
      could sooner or later bump the limit to 4096 Hz, but for now
      the best we can do is lower a bit the upper limit for
      halt_poll_ns. :-(
      Reported-by: NJon Panozzo <jonp@lime-technology.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14ebda33
  7. 29 3月, 2016 5 次提交
  8. 23 3月, 2016 3 次提交
  9. 22 3月, 2016 4 次提交
  10. 21 3月, 2016 2 次提交
    • H
      x86/cpufeature, perf/x86: Add AMD Accumulated Power Mechanism feature flag · 01fe03ff
      Huang Rui 提交于
      AMD CPU family 15h model 0x60 introduces a mechanism for measuring
      accumulated power. It is used to report the processor power consumption
      and support for it is indicated by CPUID Fn8000_0007_EDX[12].
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Wan Zongshun <Vincent.Wan@amd.com>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/1452739808-11871-4-git-send-email-ray.huang@amd.com
      [ Resolved conflict and moved the synthetic CPUID slot to 19. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      01fe03ff
    • V
      perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init · 33c3cc7a
      Vikas Shivappa 提交于
      The MBM init patch enumerates the Intel MBM (Memory b/w monitoring)
      and initializes the perf events and datastructures for monitoring the
      memory b/w.
      
      Its based on original patch series by Tony Luck and Kanaka Juvva.
      
      Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor
      bandwidth from one level of cache to another. The current patches
      support L3 external bandwidth monitoring. It supports both 'local
      bandwidth' and 'total bandwidth' monitoring for the socket. Local
      bandwidth measures the amount of data sent through the memory controller
      on the socket and total b/w measures the total system bandwidth.
      
      Extending the cache quality of service monitoring (CQM) we add two
      more events to the perf infrastructure:
      
        intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
        intel_cqm_llc/total_bytes - total L3 external bytes sent
      
      The tasks are associated with a Resouce Monitoring ID (RMID) just like
      in CQM and OS uses a MSR write to indicate the RMID of the task during
      scheduling.
      Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fenghua.yu@intel.com
      Cc: h.peter.anvin@intel.com
      Cc: ravi.v.shankar@intel.com
      Cc: vikas.shivappa@intel.com
      Link: http://lkml.kernel.org/r/1457652732-4499-4-git-send-email-vikas.shivappa@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33c3cc7a
  11. 18 3月, 2016 2 次提交
    • T
      x86/irq: Cure live lock in fixup_irqs() · 551adc60
      Thomas Gleixner 提交于
      Harry reported, that he's able to trigger a system freeze with cpu hot
      unplug. The freeze turned out to be a live lock caused by recent changes in
      irq_force_complete_move().
      
      When fixup_irqs() and from there irq_force_complete_move() is called on the
      dying cpu, then all other cpus are in stop machine an wait for the dying cpu
      to complete the teardown. If there is a move of an interrupt pending then
      irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
      mask and waits for them to clear the mask. That's obviously impossible as
      those cpus are firmly stuck in stop machine with interrupts disabled.
      
      I should have known that, but I completely overlooked it being concentrated on
      the locking issues around the vectors. And the existance of the call to
      __irq_complete_move() in the code, which actually sends the cleanup IPI made
      it reasonable to wait for that cleanup to complete. That call was bogus even
      before the recent changes as it was just a pointless distraction.
      
      We have to look at two cases:
      
      1) The move_in_progress flag of the interrupt is set
      
         This means the ioapic has been updated with the new vector, but it has not
         fired yet. In theory there is a race:
      
         set_ioapic(new_vector) <-- Interrupt is raised before update is effective,
         			      i.e. it's raised on the old vector. 
      
         So if the target cpu cannot handle that interrupt before the old vector is
         cleaned up, we get a spurious interrupt and in the worst case the ioapic
         irq line becomes stale, but my experiments so far have only resulted in
         spurious interrupts.
      
         But in case of cpu hotplug this should be a non issue because if the
         affinity update happens right before all cpus rendevouz in stop machine,
         there is no way that the interrupt can be blocked on the target cpu because
         all cpus loops first with interrupts enabled in stop machine, so the old
         vector is not yet cleaned up when the interrupt fires.
      
         So the only way to run into this issue is if the delivery of the interrupt
         on the apic/system bus would be delayed beyond the point where the target
         cpu disables interrupts in stop machine. I doubt that it can happen, but at
         least there is a theroretical chance. Virtualization might be able to
         expose this, but AFAICT the IOAPIC emulation is not as stupid as the real
         hardware.
      
         I've spent quite some time over the weekend to enforce that situation,
         though I was not able to trigger the delayed case.
      
      2) The move_in_progress flag is not set and the old_domain cpu mask is not
         empty.
      
         That means, that an interrupt was delivered after the change and the
         cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
         responded to it yet.
      
      In both cases we can assume that the next interrupt will arrive on the new
      vector, so we can cleanup the old vectors on the cpus in the old_domain cpu
      mask.
      
      Fixes: 98229aa3 "x86/irq: Plug vector cleanup race"
      Reported-by: NHarry Junior <harryjr@outlook.fr>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Joe Lawrence <joe.lawrence@stratus.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603140931430.3657@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      551adc60
    • D
      x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt() · 7834c103
      Dave Jones 提交于
      Since 4.4, I've been able to trigger this occasionally:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.5.0-rc7-think+ #3 Not tainted
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.ukSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      
      -------------------------------
      ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from idle CPU!
      rcu_scheduler_active = 1, debug_locks = 1
      RCU used illegally from extended quiescent state!
      no locks held by swapper/3/0.
      
      stack backtrace:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3
       ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a
       ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6
       ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60
      Call Trace:
       <IRQ>  [<ffffffff92560c2a>] dump_stack+0x67/0x9d
       [<ffffffff921376a6>] lockdep_rcu_suspicious+0xe6/0x100
       [<ffffffff925ae7a7>] do_trace_write_msr+0x127/0x1a0
       [<ffffffff92061c83>] native_apic_msr_eoi_write+0x23/0x30
       [<ffffffff92054408>] smp_trace_call_function_interrupt+0x38/0x360
       [<ffffffff92d1ca60>] trace_call_function_interrupt+0x90/0xa0
       <EOI>  [<ffffffff92ac5124>] ? cpuidle_enter_state+0x1b4/0x520
      
      Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
      tells the RCU susbstems to end the extended quiescent state, so that the
      following trace call in ack_APIC_irq() works correctly.
      Suggested-by: NAndi Kleen <ak@linux.intel.com>
      Fixes: 4787c368 "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()"
      Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      7834c103
  12. 17 3月, 2016 1 次提交
    • A
      x86/iopl/64: Properly context-switch IOPL on Xen PV · b7a58459
      Andy Lutomirski 提交于
      On Xen PV, regs->flags doesn't reliably reflect IOPL and the
      exit-to-userspace code doesn't change IOPL.  We need to context
      switch it manually.
      
      I'm doing this without going through paravirt because this is
      specific to Xen PV.  After the dust settles, we can merge this with
      the 32-bit code, tidy up the iopl syscall implementation, and remove
      the set_iopl pvop entirely.
      
      Fixes XSA-171.
      Reviewewd-by: NJan Beulich <JBeulich@suse.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7a58459
  13. 16 3月, 2016 1 次提交
    • T
      x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe() · cbf8b5a2
      Tony Luck 提交于
      Returning a 'bool' was very unpopular. Doubly so because the
      code was just wrong (returning zero for true, one for false;
      great for shell programming, not so good for C).
      
      Change return type to "int". Keep zero as the success indicator
      because it matches other similar code and people may be more
      comfortable writing:
      
      	if (memcpy_mcsafe(to, from, count)) {
      		printk("Sad panda, copy failed\n");
      		...
      	}
      
      Make the failure return value -EFAULT for now.
      
      Reported by: Mika Penttilä <mika.penttila@nextfour.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mika.penttila@nextfour.com
      Fixes: 92b0729c ("x86/mm, x86/mce: Add memcpy_mcsafe()")
      Link: http://lkml.kernel.org/r/695f14233fa7a54fcac4406c706d7fec228e3f4c.1457993040.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cbf8b5a2
  14. 14 3月, 2016 2 次提交
    • A
      ipv6: Pass proto to csum_ipv6_magic as __u8 instead of unsigned short · 1e940829
      Alexander Duyck 提交于
      This patch updates csum_ipv6_magic so that it correctly recognizes that
      protocol is a unsigned 8 bit value.
      
      This will allow us to better understand what limitations may or may not be
      present in how we handle the data.  For example there are a number of
      places that call htonl on the protocol value.  This is likely not necessary
      and can be replaced with a multiplication by ntohl(1) which will be
      converted to a shift by the compiler.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e940829
    • A
      ipv4: Update parameters for csum_tcpudp_magic to their original types · 01cfbad7
      Alexander Duyck 提交于
      This patch updates all instances of csum_tcpudp_magic and
      csum_tcpudp_nofold to reflect the types that are usually used as the source
      inputs.  For example the protocol field is populated based on nexthdr which
      is actually an unsigned 8 bit value.  The length is usually populated based
      on skb->len which is an unsigned integer.
      
      This addresses an issue in which the IPv6 function csum_ipv6_magic was
      generating a checksum using the full 32b of skb->len while
      csum_tcpudp_magic was only using the lower 16 bits.  As a result we could
      run into issues when attempting to adjust the checksum as there was no
      protocol agnostic way to update it.
      
      With this change the value is still truncated as many architectures use
      "(len + proto) << 8", however this truncation only occurs for values
      greater than 16776960 in length and as such is unlikely to occur as we stop
      the inner headers at ~64K in size.
      
      I did have to make a few minor changes in the arm, mn10300, nios2, and
      score versions of the function in order to support these changes as they
      were either using things such as an OR to combine the protocol and length,
      or were using ntohs to convert the length which would have truncated the
      value.
      
      I also updated a few spots in terms of whitespace and type differences for
      the addresses.  Most of this was just to make sure all of the definitions
      were in sync going forward.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01cfbad7
  15. 13 3月, 2016 1 次提交
    • F
      x86/cpufeature: Enable new AVX-512 features · d0500494
      Fenghua Yu 提交于
      A few new AVX-512 instruction groups/features are added in cpufeatures.h
      for enuermation: AVX512DQ, AVX512BW, and AVX512VL.
      
      Clear the flags in fpu__xstate_clear_all_cpu_caps().
      
      The specification for latest AVX-512 including the features can be found at:
      
        https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
      
      Note, I didn't enable the flags in KVM. Hopefully the KVM guys can pick up
      the flags and enable them in KVM.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/1457667498-37357-1-git-send-email-fenghua.yu@intel.com
      [ Added more detailed feature descriptions. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d0500494
  16. 11 3月, 2016 1 次提交
  17. 10 3月, 2016 7 次提交
    • A
      x86/entry: Call enter_from_user_mode() with IRQs off · 9999c8c0
      Andy Lutomirski 提交于
      Now that slow-path syscalls always enter C before enabling
      interrupts, it's straightforward to call enter_from_user_mode() before
      enabling interrupts rather than doing it as part of entry tracing.
      
      With this change, we should finally be able to retire exception_enter().
      
      This will also enable optimizations based on knowing that we never
      change context tracking state with interrupts on.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/bc376ecf87921a495e874ff98139b1ca2f5c5dd7.1457558566.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9999c8c0
    • Y
      x86/fpu: Revert ("x86/fpu: Disable AVX when eagerfpu is off") · a65050c6
      Yu-cheng Yu 提交于
      Leonid Shatz noticed that the SDM interpretation of the following
      recent commit:
      
        394db20c ("x86/fpu: Disable AVX when eagerfpu is off")
      
      ... is incorrect and that the original behavior of the FPU code was correct.
      
      Because AVX is not stated in CR0 TS bit description, it was mistakenly
      believed to be not supported for lazy context switch. This turns out
      to be false:
      
        Intel Software Developer's Manual Vol. 3A, Sec. 2.5 Control Registers:
      
         'TS Task Switched bit (bit 3 of CR0) -- Allows the saving of the x87 FPU/
          MMX/SSE/SSE2/SSE3/SSSE3/SSE4 context on a task switch to be delayed until
          an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction is actually executed
          by the new task.'
      
        Intel Software Developer's Manual Vol. 2A, Sec. 2.4 Instruction Exception
        Specification:
      
         'AVX instructions refer to exceptions by classes that include #NM
          "Device Not Available" exception for lazy context switch.'
      
      So revert the commit.
      Reported-by: NLeonid Shatz <leonid.shatz@ravellosystems.com>
      Signed-off-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1457569734-3785-1-git-send-email-yu-cheng.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a65050c6
    • A
      x86/entry: Remove TIF_SINGLESTEP entry work · 392a6254
      Andy Lutomirski 提交于
      Now that SYSENTER with TF set puts X86_EFLAGS_TF directly into
      regs->flags, we don't need a TIF_SINGLESTEP fixup in the syscall
      entry code.  Remove it.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/2d15f24da52dafc9d2f0b8d76f55544f4779c517.1457578375.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      392a6254
    • A
      x86/entry/32: Add and check a stack canary for the SYSENTER stack · 2a41aa4f
      Andy Lutomirski 提交于
      The first instruction of the SYSENTER entry runs on its own tiny
      stack.  That stack can be used if a #DB or NMI is delivered before
      the SYSENTER prologue switches to a real stack.
      
      We have code in place to prevent us from overflowing the tiny stack.
      For added paranoia, add a canary to the stack and check it in
      do_debug() -- that way, if something goes wrong with the #DB logic,
      we'll eventually notice.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/6ff9a806f39098b166dc2c41c1db744df5272f29.1457578375.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2a41aa4f
    • A
      x86/entry: Only allocate space for tss_struct::SYSENTER_stack if needed · 6dcc9414
      Andy Lutomirski 提交于
      The SYSENTER stack is only used on 32-bit kernels.  Remove it on 64-bit kernels.
      
      ( We may end up using it down the road on 64-bit kernels. If so,
        we'll re-enable it for CONFIG_IA32_EMULATION. )
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/9dbd18429f9ff61a76b6eda97a9ea20510b9f6ba.1457578375.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6dcc9414
    • A
      x86/entry: Vastly simplify SYSENTER TF (single-step) handling · f2b37575
      Andy Lutomirski 提交于
      Due to a blatant design error, SYSENTER doesn't clear TF (single-step).
      
      As a result, if a user does SYSENTER with TF set, we will single-step
      through the kernel until something clears TF.  There is absolutely
      nothing we can do to prevent this short of turning off SYSENTER [1].
      
      Simplify the handling considerably with two changes:
      
        1. We already sanitize EFLAGS in SYSENTER to clear NT and AC.  We can
           add TF to that list of flags to sanitize with no overhead whatsoever.
      
        2. Teach do_debug() to ignore single-step traps in the SYSENTER prologue.
      
      That's all we need to do.
      
      Don't get too excited -- our handling is still buggy on 32-bit
      kernels.  There's nothing wrong with the SYSENTER code itself, but
      the #DB prologue has a clever fixup for traps on the very first
      instruction of entry_SYSENTER_32, and the fixup doesn't work quite
      correctly.  The next two patches will fix that.
      
      [1] We could probably prevent it by forcing BTF on at all times and
          making sure we clear TF before any branches in the SYSENTER
          code.  Needless to say, this is a bad idea.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/a30d2ea06fe4b621fe6a9ef911b02c0f38feb6f2.1457578375.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f2b37575
    • D
      libnvdimm, pmem: clear poison on write · 59e64739
      Dan Williams 提交于
      If a write is directed at a known bad block perform the following:
      
      1/ write the data
      
      2/ send a clear poison command
      
      3/ invalidate the poison out of the cache hierarchy
      
      Cc: <x86@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      59e64739
  18. 09 3月, 2016 2 次提交
  19. 08 3月, 2016 3 次提交
    • A
      x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled · 58a5aac5
      Andy Lutomirski 提交于
      x86_64 has very clean espfix handling on paravirt: espfix64 is set
      up in native_iret, so paravirt systems that override iret bypass
      espfix64 automatically.  This is robust and straightforward.
      
      x86_32 is messier.  espfix is set up before the IRET paravirt patch
      point, so it can't be directly conditionalized on whether we use
      native_iret.  We also can't easily move it into native_iret without
      regressing performance due to a bizarre consideration.  Specifically,
      on 64-bit kernels, the logic is:
      
        if (regs->ss & 0x4)
                setup_espfix;
      
      On 32-bit kernels, the logic is:
      
        if ((regs->ss & 0x4) && (regs->cs & 0x3) == 3 &&
            (regs->flags & X86_EFLAGS_VM) == 0)
                setup_espfix;
      
      The performance of setup_espfix itself is essentially irrelevant, but
      the comparison happens on every IRET so its performance matters.  On
      x86_64, there's no need for any registers except flags to implement
      the comparison, so we fold the whole thing into native_iret.  On
      x86_32, we don't do that because we need a free register to
      implement the comparison efficiently.  We therefore do espfix setup
      before restoring registers on x86_32.
      
      This patch gets rid of the explicit paravirt_enabled check by
      introducing X86_BUG_ESPFIX on 32-bit systems and using an ALTERNATIVE
      to skip espfix on paravirt systems where iret != native_iret.  This is
      also messy, but it's at least in line with other things we do.
      
      This improves espfix performance by removing a branch, but no one
      cares.  More importantly, it removes a paravirt_enabled user, which is
      good because paravirt_enabled is ill-defined and is going away.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boris.ostrovsky@oracle.com
      Cc: david.vrabel@citrix.com
      Cc: konrad.wilk@oracle.com
      Cc: lguest@lists.ozlabs.org
      Cc: xen-devel@lists.xensource.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      58a5aac5
    • P
      KVM: MMU: simplify last_pte_bitmap · 6bb69c9b
      Paolo Bonzini 提交于
      Branch-free code is fun and everybody knows how much Avi loves it,
      but last_pte_bitmap takes it a bit to the extreme.  Since the code
      is simply doing a range check, like
      
      	(level == 1 ||
      	 ((gpte & PT_PAGE_SIZE_MASK) && level < N)
      
      we can make it branch-free without storing the entire truth table;
      it is enough to cache N.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6bb69c9b
    • D
      x86/apic: Deinline __default_send_IPI_*, save ~200 bytes · 1a8aa8ac
      Denys Vlasenko 提交于
      __default_send_IPI_shortcut: 49 bytes, 2 callsites
      __default_send_IPI_dest_field: 108 bytes, 7 callsites
      
           text     data      bss       dec     hex filename
       96184086 20860488 36122624 153167198 921255e vmlinux_before
       96183823 20860520 36122624 153166967 9212477 vmlinux
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Borislav Petkov <bp@alien.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1457287876-6001-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1a8aa8ac