1. 30 7月, 2017 2 次提交
  2. 27 7月, 2017 1 次提交
    • A
      x86/ldt/64: Refresh DS and ES when modify_ldt changes an entry · a6323757
      Andy Lutomirski 提交于
      On x86_32, modify_ldt() implicitly refreshes the cached DS and ES
      segments because they are refreshed on return to usermode.
      
      On x86_64, they're not refreshed on return to usermode.  To improve
      determinism and match x86_32's behavior, refresh them when we update
      the LDT.
      
      This avoids a situation in which the DS points to a descriptor that is
      changed but the old cached segment persists until the next reschedule.
      If this happens, then the user-visible state will change
      nondeterministically some time after modify_ldt() returns, which is
      unfortunate.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Chang Seok <chang.seok.bae@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a6323757
  3. 26 7月, 2017 1 次提交
    • J
      x86/unwind: Add the ORC unwinder · ee9f8fce
      Josh Poimboeuf 提交于
      Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
      It plugs into the existing x86 unwinder framework.
      
      It relies on objtool to generate the needed .orc_unwind and
      .orc_unwind_ip sections.
      
      For more details on why ORC is used instead of DWARF, see
      Documentation/x86/orc-unwinder.txt - but the short version is
      that it's a simplified, fundamentally more robust debugninfo
      data structure, which also allows up to two orders of magnitude
      faster lookups than the DWARF unwinder - which matters to
      profiling workloads like perf.
      
      Thanks to Andy Lutomirski for the performance improvement ideas:
      splitting the ORC unwind table into two parallel arrays and creating a
      fast lookup table to search a subset of the unwind table.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
      [ Extended the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ee9f8fce
  4. 18 7月, 2017 3 次提交
  5. 14 7月, 2017 1 次提交
  6. 13 7月, 2017 2 次提交
  7. 05 7月, 2017 4 次提交
    • C
      x86/boot/e820: Introduce the bootloader provided e820_table_firmware[] table · 12df216c
      Chen Yu 提交于
      Add the real e820_tabel_firmware[] that will not be modified by the kernel
      or the EFI boot stub under any circumstance.
      
      In addition to that modify the code so that e820_table_firmwarep[] is
      exposed via sysfs to represent the real firmware memory layout,
      rather than exposing the e820_table_kexec[] table.
      
      This fixes a hibernation bug/warning, which uses e820_table_kexec[] to check
      RAM layout consistency across hibernation/resume:
      
        The suspend kernel:
        [    0.000000] e820: update [mem 0x76671018-0x76679457] usable ==> usable
      
        The resume kernel:
        [    0.000000] e820: update [mem 0x7666f018-0x76677457] usable ==> usable
        ...
        [   15.752088] PM: Using 3 thread(s) for decompression.
        [   15.752088] PM: Loading and decompressing image data (471870 pages)...
        [   15.764971] Hibernate inconsistent memory map detected!
        [   15.770833] PM: Image mismatch: architecture specific data
      
      Actually it is safe to restore these pages because E820_TYPE_RAM and
      E820_TYPE_RESERVED_KERN are treated the same during hibernation, so
      the original e820 table provided by the bootloader is used for
      hibernation MD5 fingerprint checking.
      
      The side effect is that, this newly introduced variable might increase the
      kernel size at compile time.
      Suggested-by: NIngo Molnar <mingo@redhat.com>
      Signed-off-by: NChen Yu <yu.c.chen@intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xunlei Pang <xlpang@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      12df216c
    • C
      x86/boot/e820: Rename the e820_table_firmware to e820_table_kexec · a09bae0f
      Chen Yu 提交于
      Currently the e820_table_firmware[] table is mainly used by the kexec,
      and it is not what it's supposed to be - despite its name it might be
      modified by the kernel.
      
      So change its name to e820_table_kexec[]. In the next patch we will
      introduce the real e820_table_firmware[] table.
      
      No functional change.
      Signed-off-by: NChen Yu <yu.c.chen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xunlei Pang <xlpang@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a09bae0f
    • C
      x86/boot/e820: Avoid overwriting e820_table_firmware · b7a67e02
      Chen Yu 提交于
      The following commit in 2013:
      
        77ea8c94 ("x86: Reserve setup_data ranges late after parsing memmap cmdline")
      
      has fixed the issue of losing setup_data information by deferring the
      e820_reserve_setup_data() call until the early params have been parsed.
      
      But this also introduced a new problem that, during early params parsing,
      the kexec kernel might fake a mptable and saves it into the e820_table_firmware[]
      table (without saving the mptable to the e820_table[]), however the subsequent
      invoking of e820_reserve_setup_data() will overwrite the e820_table_firmware[]
      according to the e820_table[], thus the fake mptable information is lost.
      
      Fix this issue by updating the e820_table_firmware[] according to
      the setup_data information, but without overwriting it.
      Signed-off-by: NChen Yu <yu.c.chen@intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xunlei Pang <xlpang@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b7a67e02
    • M
      x86/mm/pat: Don't report PAT on CPUs that don't support it · 99c13b8c
      Mikulas Patocka 提交于
      The pat_enabled() logic is broken on CPUs which do not support PAT and
      where the initialization code fails to call pat_init(). Due to that the
      enabled flag stays true and pat_enabled() returns true wrongfully.
      
      As a consequence the mappings, e.g. for Xorg, are set up with the wrong
      caching mode and the required MTRR setups are omitted.
      
      To cure this the following changes are required:
      
        1) Make pat_enabled() return true only if PAT initialization was
           invoked and successful.
      
        2) Invoke init_cache_modes() unconditionally in setup_arch() and
           remove the extra callsites in pat_disable() and the pat disabled
           code path in pat_init().
      
      Also rename __pat_enabled to pat_disabled to reflect the real purpose of
      this variable.
      
      Fixes: 9cd25aac ("x86/mm/pat: Emulate PAT when it is disabled")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: "Luis R. Rodriguez" <mcgrof@suse.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1707041749300.3456@file01.intranet.prod.int.rdu2.redhat.com
      99c13b8c
  8. 01 7月, 2017 1 次提交
  9. 30 6月, 2017 2 次提交
  10. 29 6月, 2017 1 次提交
  11. 28 6月, 2017 3 次提交
  12. 27 6月, 2017 2 次提交
    • Y
      x86/ACPI/cstate: Allow ACPI C1 FFH MWAIT use on AMD systems · 5209654a
      Yazen Ghannam 提交于
      AMD systems support the Monitor/Mwait instructions and these can be used
      for ACPI C1 in the same way as on Intel systems.
      
      Three things are needed:
       1) This patch.
       2) BIOS that declares a C1 state in _CST to use FFH, with correct values.
       3) CPUID_Fn00000005_EDX is non-zero on the system.
      
      The BIOS on AMD systems have historically not defined a C1 state in _CST,
      so the acpi_idle driver uses HALT for ACPI C1.
      
      Currently released systems have CPUID_Fn00000005_EDX as reserved/RAZ. If a
      BIOS is released for these systems that requests a C1 state with FFH, the
      FFH implementation in Linux will fail since CPUID_Fn00000005_EDX is 0. The
      acpi_idle driver will then fallback to using HALT for ACPI C1.
      
      Future systems are expected to have non-zero CPUID_Fn00000005_EDX and BIOS
      support for using FFH for ACPI C1.
      
      Allow ffh_cstate_init() to succeed on AMD systems.
      
      Tested on Fam15h and Fam17h systems.
      Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5209654a
    • L
      x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF · f8475cef
      Len Brown 提交于
      The goal of this change is to give users a uniform and meaningful
      result when they read /sys/...cpufreq/scaling_cur_freq
      on modern x86 hardware, as compared to what they get today.
      
      Modern x86 processors include the hardware needed
      to accurately calculate frequency over an interval --
      APERF, MPERF, and the TSC.
      
      Here we provide an x86 routine to make this calculation
      on supported hardware, and use it in preference to any
      driver driver-specific cpufreq_driver.get() routine.
      
      MHz is computed like so:
      
      MHz = base_MHz * delta_APERF / delta_MPERF
      
      MHz is the average frequency of the busy processor
      over a measurement interval.  The interval is
      defined to be the time between successive invocations
      of aperfmperf_khz_on_cpu(), which are expected to to
      happen on-demand when users read sysfs attribute
      cpufreq/scaling_cur_freq.
      
      As with previous methods of calculating MHz,
      idle time is excluded.
      
      base_MHz above is from TSC calibration global "cpu_khz".
      
      This x86 native method to calculate MHz returns a meaningful result
      no matter if P-states are controlled by hardware or firmware
      and/or if the Linux cpufreq sub-system is or is-not installed.
      
      When this routine is invoked more frequently, the measurement
      interval becomes shorter.  However, the code limits re-computation
      to 10ms intervals so that average frequency remains meaningful.
      
      Discerning users are encouraged to take advantage of
      the turbostat(8) utility, which can gracefully handle
      concurrent measurement intervals of arbitrary length.
      Signed-off-by: NLen Brown <len.brown@intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      f8475cef
  13. 26 6月, 2017 2 次提交
  14. 24 6月, 2017 1 次提交
    • L
      x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" · 51204e06
      Len Brown 提交于
      cpufreq_quick_get() allows cpufreq drivers to over-ride cpu_khz
      that is otherwise reported in x86 /proc/cpuinfo "cpu MHz".
      
      There are four problems with this scheme,
      any of them is sufficient justification to delete it.
      
       1. Depending on which cpufreq driver is loaded, the behavior
          of this field is different.
      
       2. Distros complain that they have to explain to users
          why and how this field changes.  Distros have requested a constant.
      
       3. The two major providers of this information, acpi_cpufreq
          and intel_pstate, both "get it wrong" in different ways.
      
          acpi_cpufreq lies to the user by telling them that
          they are running at whatever frequency was last
          requested by software.
      
          intel_pstate lies to the user by telling them that
          they are running at the average frequency computed
          over an undefined measurement.  But an average computed
          over an undefined interval, is itself, undefined...
      
       4. On modern processors, user space utilities, such as
          turbostat(1), are more accurate and more precise, while
          supporing concurrent measurement over arbitrary intervals.
      
      Users who have been consulting /proc/cpuinfo to
      track changing CPU frequency will be dissapointed that
      it no longer wiggles -- perhaps being unaware of the
      limitations of the information they have been consuming.
      
      Yes, they can change their scripts to look in sysfs
      cpufreq/scaling_cur_frequency.  Here they will find the same
      data of dubious quality here removed from /proc/cpuinfo.
      The value in sysfs will be addressed in a subsequent patch
      to address issues 1-3, above.
      
      Issue 4 will remain -- users that really care about
      accurate frequency information should not be using either
      proc or sysfs kernel interfaces.
      They should be using using turbostat(8), or a similar
      purpose-built analysis tool.
      Signed-off-by: NLen Brown <len.brown@intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      51204e06
  15. 23 6月, 2017 14 次提交