1. 10 7月, 2016 2 次提交
    • Y
      x86/fpu/xstate: Fix PTRACE frames for XSAVES · 91c3dba7
      Yu-cheng Yu 提交于
      XSAVES uses compacted format and is a kernel instruction. The kernel
      should use standard-format, non-supervisor state data for PTRACE.
      Signed-off-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
      [ Edited away artificial linebreaks. ]
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/de3d80949001305fe389799973b675cab055c457.1466179491.git.yu-cheng.yu@intel.com
      [ Made various readability edits. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      91c3dba7
    • Y
      x86/fpu/xstate: Fix supervisor xstate component offset · 1499ce2d
      Yu-cheng Yu 提交于
      CPUID function 0x0d, sub function (i, i > 1) returns in ebx the offset of
      xstate component i. Zero is returned for a supervisor state. A supervisor
      state can only be saved by XSAVES and XSAVES uses a compacted format.
      There is no fixed offset for a supervisor state. This patch checks and
      makes sure a supervisor state offset is not recorded or mis-used. This has
      no effect in practice as we currently use no supervisor states, but it
      would be good to fix.
      Signed-off-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/81b29e40d35d4cec9f2511a856fe769f34935a3f.1466179491.git.yu-cheng.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1499ce2d
  2. 27 6月, 2016 2 次提交
  3. 25 6月, 2016 2 次提交
    • M
      tree wide: get rid of __GFP_REPEAT for order-0 allocations part I · 32d6bd90
      Michal Hocko 提交于
      This is the third version of the patchset previously sent [1].  I have
      basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
      rid of superfluous gfp flags" which went through dm tree.  I am sending
      it now because it is tree wide and chances for conflicts are reduced
      considerably when we want to target rc2.  I plan to send the next step
      and rename the flag and move to a better semantic later during this
      release cycle so we will have a new semantic ready for 4.8 merge window
      hopefully.
      
      Motivation:
      
      While working on something unrelated I've checked the current usage of
      __GFP_REPEAT in the tree.  It seems that a majority of the usage is and
      always has been bogus because __GFP_REPEAT has always been about costly
      high order allocations while we are using it for order-0 or very small
      orders very often.  It seems that a big pile of them is just a
      copy&paste when a code has been adopted from one arch to another.
      
      I think it makes some sense to get rid of them because they are just
      making the semantic more unclear.  Please note that GFP_REPEAT is
      documented as
      
      * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
      
      * _might_ fail.  This depends upon the particular VM implementation.
        while !costly requests have basically nofail semantic.  So one could
        reasonably expect that order-0 request with __GFP_REPEAT will not loop
        for ever.  This is not implemented right now though.
      
      I would like to move on with __GFP_REPEAT and define a better semantic
      for it.
      
        $ git grep __GFP_REPEAT origin/master | wc -l
        111
        $ git grep __GFP_REPEAT | wc -l
        36
      
      So we are down to the third after this patch series.  The remaining
      places really seem to be relying on __GFP_REPEAT due to large allocation
      requests.  This still needs some double checking which I will do later
      after all the simple ones are sorted out.
      
      I am touching a lot of arch specific code here and I hope I got it right
      but as a matter of fact I even didn't compile test for some archs as I
      do not have cross compiler for them.  Patches should be quite trivial to
      review for stupid compile mistakes though.  The tricky parts are usually
      hidden by macro definitions and thats where I would appreciate help from
      arch maintainers.
      
      [1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org
      
      This patch (of 19):
      
      __GFP_REPEAT has a rather weak semantic but since it has been introduced
      around 2.6.12 it has been ignored for low order allocations.  Yet we
      have the full kernel tree with its usage for apparently order-0
      allocations.  This is really confusing because __GFP_REPEAT is
      explicitly documented to allow allocation failures which is a weaker
      semantic than the current order-0 has (basically nofail).
      
      Let's simply drop __GFP_REPEAT from those places.  This would allow to
      identify place which really need allocator to retry harder and formulate
      a more specific semantic for what the flag is supposed to do actually.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Crispin <blogic@openwrt.org>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32d6bd90
    • L
      x86: fix up a few misc stack pointer vs thread_info confusions · aca9c293
      Linus Torvalds 提交于
      As the actual pointer value is the same for the thread stack allocation
      and the thread_info, code that confused the two worked fine, but will
      break when the thread info is moved away from the stack allocation.  It
      also looks very confusing.
      
      For example, the kprobe code wanted to know the current top of stack.
      To do that, it used this:
      
      	(unsigned long)current_thread_info() + THREAD_SIZE
      
      which did indeed give the correct value.  But it's not only a fairly
      nonsensical expression, it's also rather complex, especially since we
      actually have this:
      
      	static inline unsigned long current_top_of_stack(void)
      
      which not only gives us the value we are interested in, but happens to
      be how "current_thread_info()" is currently defined as:
      
      	(struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
      
      so using current_thread_info() to figure out the top of the stack really
      is a very round-about thing to do.
      
      The other cases are just simpler confusion about task_thread_info() vs
      task_stack_page(), which currently return the same pointer - but if you
      want the stack page, you really should be using the latter one.
      
      And there was one entirely unused assignment of the current stack to a
      thread_info pointer.
      
      All cleaned up to make more sense today, and make it easier to move the
      thread_info away from the stack in the future.
      
      No semantic changes.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aca9c293
  4. 24 6月, 2016 1 次提交
  5. 18 6月, 2016 3 次提交
  6. 16 6月, 2016 1 次提交
  7. 08 6月, 2016 2 次提交
    • D
      x86/fpu: Add tracepoints to dump FPU state at key points · d1898b73
      Dave Hansen 提交于
      I've been carrying this patch around for a bit and it's helped me
      solve at least a couple FPU-related bugs.  In addition to using
      it for debugging, I also drug it out because using AVX (and
      AVX2/AVX-512) can have serious power consequences for a modern
      core.  It's very important to be able to figure out who is using
      it.
      
      It's also insanely useful to go out and see who is using a given
      feature, like MPX or Memory Protection Keys.  If you, for
      instance, want to find all processes using protection keys, you
      can do:
      
      	echo 'xfeatures & 0x200' > filter
      
      Since 0x200 is the protection keys feature bit.
      
      Note that this touches the KVM code.  KVM did a CREATE_TRACE_POINTS
      and then included a bunch of random headers.  If anyone one of
      those included other tracepoints, it would have defined the *OTHER*
      tracepoints.  That's bogus, so move it to the right place.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160601174220.3CDFB90E@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d1898b73
    • D
      x86/cpu/intel: Introduce macros for Intel family numbers · 970442c5
      Dave Hansen 提交于
      Problem:
      
      We have a boatload of open-coded family-6 model numbers.  Half of
      them have these model numbers in hex and the other half in
      decimal.  This makes grepping for them tons of fun, if you were
      to try.
      
      Solution:
      
      Consolidate all the magic numbers.  Put all the definitions in
      one header.
      
      The names here are closely derived from the comments describing
      the models from arch/x86/events/intel/core.c.  We could easily
      make them shorter by doing things like s/SANDYBRIDGE/SNB/, but
      they seemed fine even with the longer versions to me.
      
      Do not take any of these names too literally, like "DESKTOP"
      or "MOBILE".  These are all colloquial names and not precise
      descriptions of everywhere a given model will show up.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: Eduardo Valentin <edubezval@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>
      Cc: Souvik Kumar Chakravarty <souvik.k.chakravarty@intel.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Vishwanath Somayaji <vishwanath.somayaji@intel.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: jacob.jun.pan@intel.com
      Cc: linux-acpi@vger.kernel.org
      Cc: linux-edac@vger.kernel.org
      Cc: linux-mmc@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: platform-driver-x86@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160603001927.F2A7D828@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      970442c5
  8. 06 6月, 2016 1 次提交
  9. 28 5月, 2016 1 次提交
    • R
      platform/x86: Add PMC Driver for Intel Core SoC · b740d2e9
      Rajneesh Bhardwaj 提交于
      This patch adds the Power Management Controller driver as a PCI driver
      for Intel Core SoC architecture.
      
      This driver can utilize debugging capabilities and supported features
      as exposed by the Power Management Controller.
      
      Please refer to the below specification for more details on PMC features.
      http://www.intel.in/content/www/in/en/chipsets/100-series-chipset-datasheet-vol-2.html
      
      The current version of this driver exposes SLP_S0_RESIDENCY counter.
      This counter can be used for detecting fragile SLP_S0 signal related
      failures and take corrective actions when PCH SLP_S0 signal is not
      asserted after kernel freeze as part of suspend to idle flow
      (echo freeze > /sys/power/state).
      
      Intel Platform Controller Hub (PCH) asserts SLP_S0 signal when it
      detects favorable conditions to enter its low power mode. As a
      pre-requisite the SoC should be in deepest possible Package C-State
      and devices should be in low power mode. For example, on Skylake SoC
      the deepest Package C-State is Package C10 or PC10. Suspend to idle
      flow generally leads to PC10 state but PC10 state may not be sufficient
      for realizing the platform wide power potential which SLP_S0 signal
      assertion can provide.
      
      SLP_S0 signal is often connected to the Embedded Controller (EC) and the
      Power Management IC (PMIC) for other platform power management related
      optimizations.
      
      In general, SLP_S0 assertion == PC10 + PCH low power mode + ModPhy Lanes
      power gated + PLL Idle.
      
      As part of this driver, a mechanism to read the SLP_S0_RESIDENCY is exposed
      as an API and also debugfs features are added to indicate SLP_S0 signal
      assertion residency in microseconds.
      
      echo freeze > /sys/power/state
      wake the system
      cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec
      Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>
      Signed-off-by: NVishwanath Somayaji <vishwanath.somayaji@intel.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      b740d2e9
  10. 23 5月, 2016 2 次提交
    • L
      x86: remove more uaccess_32.h complexity · bd28b145
      Linus Torvalds 提交于
      I'm looking at trying to possibly merge the 32-bit and 64-bit versions
      of the x86 uaccess.h implementation, but first this needs to be cleaned
      up.
      
      For example, the 32-bit version of "__copy_from_user_inatomic()" is
      mostly the special cases for the constant size, and it's actually almost
      never relevant.  Most users aren't actually using a constant size
      anyway, and the few cases that do small constant copies are better off
      just using __get_user() instead.
      
      So get rid of the unnecessary complexity.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd28b145
    • L
      x86: remove pointless uaccess_32.h complexity · 5b09c3ed
      Linus Torvalds 提交于
      I'm looking at trying to possibly merge the 32-bit and 64-bit versions
      of the x86 uaccess.h implementation, but first this needs to be cleaned
      up.
      
      For example, the 32-bit version of "__copy_to_user_inatomic()" is mostly
      the special cases for the constant size, and it's actually never
      relevant.  Every user except for one aren't actually using a constant
      size anyway, and the one user that uses it is better off just using
      __put_user() instead.
      
      So get rid of the unnecessary complexity.
      
      [ The same cleanup should likely happen to __copy_from_user_inatomic()
        as well, but that one has a lot more users that I need to take a look
        at first ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b09c3ed
  11. 21 5月, 2016 1 次提交
  12. 20 5月, 2016 2 次提交
    • D
      x86/mm/mpx: Work around MPX erratum SKD046 · 0f6ff2bc
      Dave Hansen 提交于
      This erratum essentially causes the CPU to forget which privilege
      level it is operating on (kernel vs. user) for the purposes of MPX.
      
      This erratum can only be triggered when a system is not using
      Supervisor Mode Execution Prevention (SMEP).  Our workaround for
      the erratum is to ensure that MPX can only be used in cases where
      SMEP is present in the processor and is enabled.
      
      This erratum only affects Core processors.  Atom is unaffected.
      But, there is no architectural way to determine Atom vs. Core.
      So, we just apply this workaround to all processors.  It's
      possible that it will mistakenly disable MPX on some Atom
      processsors or future unaffected Core processors.  There are
      currently no processors that have MPX and not SMEP.  It would
      take something akin to a hypervisor masking SMEP out on an Atom
      processor for this to present itself on current hardware.
      
      More details can be found at:
      
        http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf
      
      "
        SKD046 Branch Instructions May Initialize MPX Bound Registers Incorrectly
      
        Problem:
      
        Depending on the current Intel MPX (Memory Protection
        Extensions) configuration, execution of certain branch
        instructions (near CALL, near RET, near JMP, and Jcc
        instructions) without a BND prefix (F2H) initialize the MPX bound
        registers. Due to this erratum, such a branch instruction that is
        executed both with CPL = 3 and with CPL < 3 may not use the
        correct MPX configuration register (BNDCFGU or BNDCFGS,
        respectively) for determining whether to initialize the bound
        registers; it may thus initialize the bound registers when it
        should not, or fail to initialize them when it should.
      
        Implication:
      
        A branch instruction that has executed both in user mode and in
        supervisor mode (from the same linear address) may cause a #BR
        (bound range fault) when it should not have or may not cause a
        #BR when it should have.  Workaround An operating system can
        avoid this erratum by setting CR4.SMEP[bit 20] to enable
        supervisor-mode execution prevention (SMEP). When SMEP is
        enabled, no code can be executed both with CPL = 3 and with CPL < 3.
      "
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160512220400.3B35F1BC@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f6ff2bc
    • H
      arch: fix has_transparent_hugepage() · fd8cfd30
      Hugh Dickins 提交于
      I've just discovered that the useful-sounding has_transparent_hugepage()
      is actually an architecture-dependent minefield: on some arches it only
      builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
      not, but on some of those (arm and arm64) it then gives the wrong
      answer; and on mips alone it's marked __init, which would crash if
      called later (but so far it has not been called later).
      
      Straighten this out: make it available to all configs, with a sensible
      default in asm-generic/pgtable.h, removing its definitions from those
      arches (arc, arm, arm64, sparc, tile) which are served by the default,
      adding #define has_transparent_hugepage has_transparent_hugepage to
      those (mips, powerpc, s390, x86) which need to override the default at
      runtime, and removing the __init from mips (but maybe that kind of code
      should be avoided after init: set a static variable the first time it's
      called).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>		[arch/arc]
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[arch/s390]
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd8cfd30
  13. 19 5月, 2016 7 次提交
  14. 16 5月, 2016 1 次提交
    • D
      x86/cpufeature, x86/mm/pkeys: Fix broken compile-time disabling of pkeys · e8df1a95
      Dave Hansen 提交于
      When I added support for the Memory Protection Keys processor
      feature, I had to reindent the REQUIRED/DISABLED_MASK macros, and
      also consult the later cpufeature words.
      
      I'm not quite sure how I bungled it, but I consulted the wrong
      word at the end.  This only affected required or disabled cpu
      features in cpufeature words 14, 15 and 16.  So, only Protection
      Keys itself was screwed over here.
      
      The result was that if you disabled pkeys in your .config, you
      might still see some code show up that should have been compiled
      out.  There should be no functional problems, though.
      
      In verifying this patch I also realized that the DISABLE_PKU/OSPKE
      macros were defined backwards and that the cpu_has() check in
      setup_pku() was not doing the compile-time disabled checks.
      
      So also fix the macro for DISABLE_PKU/OSPKE and add a compile-time
      check for pkeys being enabled in setup_pku().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: dfb4a70f ("x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitions")
      Link: http://lkml.kernel.org/r/20160513221328.C200930B@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8df1a95
  15. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  16. 12 5月, 2016 3 次提交
    • Y
      x86/cpu: Add detection of AMD RAS Capabilities · 71faad43
      Yazen Ghannam 提交于
      Add a new CPUID leaf to hold the contents of CPUID 0x80000007_EBX (RasCap).
      
      Define bits that are currently in use:
      
       Bit 0: McaOverflowRecov
       Bit 1: SUCCOR
       Bit 3: ScalableMca
      Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com>
      [ Shorten comment. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1462971509-3856-5-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71faad43
    • Y
      x86/mce/AMD: Log Deferred Errors using SMCA MCA_DE{STAT,ADDR} registers · 34102009
      Yazen Ghannam 提交于
      Scalable MCA provides new registers for all banks for logging deferred
      errors: MCA_DESTAT and MCA_DEADDR. Deferred errors are always logged to
      these registers.
      
      Update the AMD deferred error handler to use these registers, if
      available.
      Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com>
      [ Sanity-check __log_error() args, massage a bit. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1462971509-3856-2-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      34102009
    • M
      x86/extable: ensure entries are swapped completely when sorting · 50c73890
      Mathias Krause 提交于
      The x86 exception table sorting was changed in commit 29934b0f
      ("x86/extable: use generic search and sort routines") to use the arch
      independent code in lib/extable.c.  However, the patch was mangled
      somehow on its way into the kernel from the last version posted at [1].
      The committed version kind of attempted to incorporate the changes of
      commit 548acf19 ("x86/mm: Expand the exception table logic to allow
      new handling options") as in _completely_ _ignoring_ the x86 specific
      'handler' member of struct exception_table_entry.  This effectively
      broke the sorting as entries will only partly be swapped now.
      
      Fortunately, the x86 Kconfig selects BUILDTIME_EXTABLE_SORT, so the
      exception table doesn't need to be sorted at runtime. However, in case
      that ever changes, we better not break the exception table sorting just
      because of that.
      
      [ Ard Biesheuvel points out that BUILDTIME_EXTABLE_SORT applies to the
        core image only, but we still rely on the sorting routines for modules
        in that case - Linus ]
      
      Fix this by providing a swap_ex_entry_fixup() macro that takes care of
      the 'handler' member.
      
      [1] https://lkml.org/lkml/2016/1/27/232Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Fixes: 29934b0f ("x86/extable: use generic search and sort routines")
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50c73890
  17. 07 5月, 2016 3 次提交
    • K
      x86/KASLR: Build identity mappings on demand · 3a94707d
      Kees Cook 提交于
      Currently KASLR only supports relocation in a small physical range (from
      16M to 1G), due to using the initial kernel page table identity mapping.
      To support ranges above this, we need to have an identity mapping for the
      desired memory range before we can decompress (and later run) the kernel.
      
      32-bit kernels already have the needed identity mapping. This patch adds
      identity mappings for the needed memory ranges on 64-bit kernels. This
      happens in two possible boot paths:
      
      If loaded via startup_32(), we need to set up the needed identity map.
      
      If loaded from a 64-bit bootloader, the bootloader will have already
      set up an identity mapping, and we'll start via the compressed kernel's
      startup_64(). In this case, the bootloader's page tables need to be
      avoided while selecting the new uncompressed kernel location. If not,
      the decompressor could overwrite them during decompression.
      
      To accomplish this, we could walk the pagetable and find every page
      that is used, and add them to mem_avoid, but this needs extra code and
      will require increasing the size of the mem_avoid array.
      
      Instead, we can create a new set of page tables for our own identity
      mapping instead. The pages for the new page table will come from the
      _pagetable section of the compressed kernel, which means they are
      already contained by in mem_avoid array. To do this, we reuse the code
      from the uncompressed kernel's identity mapping routines.
      
      The _pgtable will be shared by both the 32-bit and 64-bit paths to reduce
      init_size, as now the compressed kernel's _rodata to _end will contribute
      to init_size.
      
      To handle the possible mappings, we need to increase the existing page
      table buffer size:
      
      When booting via startup_64(), we need to cover the old VO, params,
      cmdline and uncompressed kernel. In an extreme case we could have them
      all beyond the 512G boundary, which needs (2+2)*4 pages with 2M mappings.
      And we'll need 2 for first 2M for VGA RAM. One more is needed for level4.
      This gets us to 19 pages total.
      
      When booting via startup_32(), KASLR could move the uncompressed kernel
      above 4G, so we need to create extra identity mappings, which should only
      need (2+2) pages at most when it is beyond the 512G boundary. So 19
      pages is sufficient for this case as well.
      
      The resulting BOOT_*PGT_SIZE defines use the "_SIZE" suffix on their
      names to maintain logical consistency with the existing BOOT_HEAP_SIZE
      and BOOT_STACK_SIZE defines.
      
      This patch is based on earlier patches from Yinghai Lu and Baoquan He.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: lasse.collin@tukaani.org
      Link: http://lkml.kernel.org/r/1462572095-11754-4-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3a94707d
    • Y
      x86/boot: Split out kernel_ident_mapping_init() · cf4fb15b
      Yinghai Lu 提交于
      In order to support on-demand page table creation when moving the
      kernel for KASLR, we need to use kernel_ident_mapping_init() in the
      decompression code.
      
      This splits it out into its own file for use outside of init_64.c.
      Additionally, checking for __pa/__va defines is added since they
      need to be overridden in the decompression code.
      
      [kees: rewrote changelog]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: lasse.collin@tukaani.org
      Link: http://lkml.kernel.org/r/1462572095-11754-3-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cf4fb15b
    • K
      x86/boot: Clean up indenting for asm/boot.h · 8665e6ff
      Kees Cook 提交于
      Before adding more defines to asm/boot.h, this cleans up the existing
      indenting for readability.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: lasse.collin@tukaani.org
      Link: http://lkml.kernel.org/r/1462572095-11754-2-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8665e6ff
  18. 06 5月, 2016 1 次提交
  19. 05 5月, 2016 4 次提交