1. 09 9月, 2016 10 次提交
    • L
      x86/efi: Remove unused find_bits() function · 15cf7cae
      Lukas Wunner 提交于
      Left behind by commit fc372064 ("efi/libstub: Move Graphics Output
      Protocol handling to generic code").
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      15cf7cae
    • A
      x86/efi: Map in physical addresses in efi_map_region_fixed · 0513fe1d
      Alex Thorlton 提交于
      This is a simple change to add in the physical mappings as well as the
      virtual mappings in efi_map_region_fixed.  The motivation here is to
      get access to EFI runtime code that is only available via the 1:1
      mappings on a kexec'd kernel.
      
      The added call is essentially the kexec analog of the first __map_region
      that Boris put in efi_map_region in commit d2f7cbe7 ("x86/efi:
      Runtime services virtual mapping").
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      0513fe1d
    • C
      x86/efi: Initialize status to ensure garbage is not returned on small size · ac0e94b6
      Colin Ian King 提交于
      Although very unlikey, if size is too small or zero, then we end up with
      status not being set and returning garbage. Instead, initializing status to
      EFI_INVALID_PARAMETER to indicate that size is invalid in the calls to
      setup_uga32 and setup_uga64.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      ac0e94b6
    • M
      x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data · 4bc9f92e
      Matt Fleming 提交于
      efi_mem_reserve() allows us to permanently mark EFI boot services
      regions as reserved, which means we no longer need to copy the image
      data out and into a separate buffer.
      
      Leaving the data in the original boot services region has the added
      benefit that BGRT images can now be passed across kexec reboot.
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
      Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Môshe van der Sterre <me@moshe.nl>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      4bc9f92e
    • M
      efi/runtime-map: Use efi.memmap directly instead of a copy · 31ce8cc6
      Matt Fleming 提交于
      Now that efi.memmap is available all of the time there's no need to
      allocate and build a separate copy of the EFI memory map.
      
      Furthermore, efi.memmap contains boot services regions but only those
      regions that have been reserved via efi_mem_reserve(). Using
      efi.memmap allows us to pass boot services across kexec reboot so that
      the ESRT and BGRT drivers will now work.
      
      Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
      Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      31ce8cc6
    • M
      efi: Allow drivers to reserve boot services forever · 816e7612
      Matt Fleming 提交于
      Today, it is not possible for drivers to reserve EFI boot services for
      access after efi_free_boot_services() has been called on x86. For
      ARM/arm64 it can be done simply by calling memblock_reserve().
      
      Having this ability for all three architectures is desirable for a
      couple of reasons,
      
        1) It saves drivers copying data out of those regions
        2) kexec reboot can now make use of things like ESRT
      
      Instead of using the standard memblock_reserve() which is insufficient
      to reserve the region on x86 (see efi_reserve_boot_services()), a new
      API is introduced in this patch; efi_mem_reserve().
      
      efi.memmap now always represents which EFI memory regions are
      available. On x86 the EFI boot services regions that have not been
      reserved via efi_mem_reserve() will be removed from efi.memmap during
      efi_free_boot_services().
      
      This has implications for kexec, since it is not possible for a newly
      kexec'd kernel to access the same boot services regions that the
      initial boot kernel had access to unless they are reserved by every
      kexec kernel in the chain.
      
      Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
      Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      816e7612
    • M
      efi: Add efi_memmap_init_late() for permanent EFI memmap · dca0f971
      Matt Fleming 提交于
      Drivers need a way to access the EFI memory map at runtime. ARM and
      arm64 currently provide this by remapping the EFI memory map into the
      vmalloc space before setting up the EFI virtual mappings.
      
      x86 does not provide this functionality which has resulted in the code
      in efi_mem_desc_lookup() where it will manually map individual EFI
      memmap entries if the memmap has already been torn down on x86,
      
        /*
         * If a driver calls this after efi_free_boot_services,
         * ->map will be NULL, and the target may also not be mapped.
         * So just always get our own virtual map on the CPU.
         *
         */
        md = early_memremap(p, sizeof (*md));
      
      There isn't a good reason for not providing a permanent EFI memory map
      for runtime queries, especially since the EFI regions are not mapped
      into the standard kernel page tables.
      
      Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
      Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      dca0f971
    • M
      efi: Refactor efi_memmap_init_early() into arch-neutral code · 9479c7ce
      Matt Fleming 提交于
      Every EFI architecture apart from ia64 needs to setup the EFI memory
      map at efi.memmap, and the code for doing that is essentially the same
      across all implementations. Therefore, it makes sense to factor this
      out into the common code under drivers/firmware/efi/.
      
      The only slight variation is the data structure out of which we pull
      the initial memory map information, such as physical address, memory
      descriptor size and version, etc. We can address this by passing a
      generic data structure (struct efi_memory_map_data) as the argument to
      efi_memmap_init_early() which contains the minimum info required for
      initialising the memory map.
      
      In the process, this patch also fixes a few undesirable implementation
      differences:
      
       - ARM and arm64 were failing to clear the EFI_MEMMAP bit when
         unmapping the early EFI memory map. EFI_MEMMAP indicates whether
         the EFI memory map is mapped (not the regions contained within) and
         can be traversed.  It's more correct to set the bit as soon as we
         memremap() the passed in EFI memmap.
      
       - Rename efi_unmmap_memmap() to efi_memmap_unmap() to adhere to the
         regular naming scheme.
      
      This patch also uses a read-write mapping for the memory map instead
      of the read-only mapping currently used on ARM and arm64. x86 needs
      the ability to update the memory map in-place when assigning virtual
      addresses to regions (efi_map_region()) and tagging regions when
      reserving boot services (efi_reserve_boot_services()).
      
      There's no way for the generic fake_mem code to know which mapping to
      use without introducing some arch-specific constant/hook, so just use
      read-write since read-only is of dubious value for the EFI memory map.
      
      Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
      Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      9479c7ce
    • M
      x86/efi: Consolidate region mapping logic · ab72a27d
      Matt Fleming 提交于
      EFI regions are currently mapped in two separate places. The bulk of
      the work is done in efi_map_regions() but when CONFIG_EFI_MIXED is
      enabled the additional regions that are required when operating in
      mixed mode are mapping in efi_setup_page_tables().
      
      Pull everything into efi_map_regions() and refactor the test for
      which regions should be mapped into a should_map_region() function.
      Generously sprinkle comments to clarify the different cases.
      Acked-by: NBorislav Petkov <bp@suse.de>
      Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
      Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      ab72a27d
    • M
      x86/efi: Test for EFI_MEMMAP functionality when iterating EFI memmap · 4971531a
      Matt Fleming 提交于
      Both efi_find_mirror() and efi_fake_memmap() really want to know
      whether the EFI memory map is available, not just whether the machine
      was booted using EFI. efi_fake_memmap() even has a check for
      EFI_MEMMAP at the start of the function.
      
      Since we've already got other code that has this dependency, merge
      everything under one if() conditional, and remove the now superfluous
      check from efi_fake_memmap().
      
      Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
      Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      4971531a
  2. 03 9月, 2016 2 次提交
    • E
      x86/AMD: Apply erratum 665 on machines without a BIOS fix · d1992996
      Emanuel Czirai 提交于
      AMD F12h machines have an erratum which can cause DIV/IDIV to behave
      unpredictably. The workaround is to set MSRC001_1029[31] but sometimes
      there is no BIOS update containing that workaround so let's do it
      ourselves unconditionally. It is simple enough.
      
      [ Borislav: Wrote commit message. ]
      Signed-off-by: NEmanuel Czirai <icanrealizeum@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Yaowu Xu <yaowu@google.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160902053550.18097-1-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d1992996
    • S
      x86/paravirt: Do not trace _paravirt_ident_*() functions · 15301a57
      Steven Rostedt 提交于
      Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
      after enabling function tracer. I asked him to bisect the functions within
      available_filter_functions, which he did and it came down to three:
      
        _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
      
      It was found that this is only an issue when noreplace-paravirt is added
      to the kernel command line.
      
      This means that those functions are most likely called within critical
      sections of the funtion tracer, and must not be traced.
      
      In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
      longer an issue.  But both _paravirt_ident_{32,64}() causes the
      following splat when they are traced:
      
       mm/pgtable-generic.c:33: bad pmd ffff8800d2435150(0000000001d00054)
       mm/pgtable-generic.c:33: bad pmd ffff8800d3624190(0000000001d00070)
       mm/pgtable-generic.c:33: bad pmd ffff8800d36a5110(0000000001d00054)
       mm/pgtable-generic.c:33: bad pmd ffff880118eb1450(0000000001d00054)
       NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
       Modules linked in: e1000e
       CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
       task: ffff880118f740c0 ti: ffff8800d4aec000 task.ti: ffff8800d4aec000
       RIP: 0010:[<ffffffff81134148>]  [<ffffffff81134148>] queued_spin_lock_slowpath+0x118/0x1a0
       RSP: 0018:ffff8800d4aefb90  EFLAGS: 00000246
       RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011eb16d40
       RDX: ffffffff82485760 RSI: 000000001f288820 RDI: ffffea0000008030
       RBP: ffff8800d4aefb90 R08: 00000000000c0000 R09: 0000000000000000
       R10: ffffffff821c8e0e R11: 0000000000000000 R12: ffff880000200fb8
       R13: 00007f7a4e3f7000 R14: ffffea000303f600 R15: ffff8800d4b562e0
       FS:  00007f7a4e3d7840(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f7a4e3f7000 CR3: 00000000d3e71000 CR4: 00000000001406e0
       Call Trace:
         _raw_spin_lock+0x27/0x30
         handle_pte_fault+0x13db/0x16b0
         handle_mm_fault+0x312/0x670
         __do_page_fault+0x1b1/0x4e0
         do_page_fault+0x22/0x30
         page_fault+0x28/0x30
         __vfs_read+0x28/0xe0
         vfs_read+0x86/0x130
         SyS_read+0x46/0xa0
         entry_SYSCALL_64_fastpath+0x1e/0xa8
       Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b
      Reported-by: NŁukasz Daniluk <lukasz.daniluk@intel.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15301a57
  3. 02 9月, 2016 1 次提交
  4. 31 8月, 2016 1 次提交
    • J
      mm/usercopy: get rid of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS · 0d025d27
      Josh Poimboeuf 提交于
      There are three usercopy warnings which are currently being silenced for
      gcc 4.6 and newer:
      
      1) "copy_from_user() buffer size is too small" compile warning/error
      
         This is a static warning which happens when object size and copy size
         are both const, and copy size > object size.  I didn't see any false
         positives for this one.  So the function warning attribute seems to
         be working fine here.
      
         Note this scenario is always a bug and so I think it should be
         changed to *always* be an error, regardless of
         CONFIG_DEBUG_STRICT_USER_COPY_CHECKS.
      
      2) "copy_from_user() buffer size is not provably correct" compile warning
      
         This is another static warning which happens when I enable
         __compiletime_object_size() for new compilers (and
         CONFIG_DEBUG_STRICT_USER_COPY_CHECKS).  It happens when object size
         is const, but copy size is *not*.  In this case there's no way to
         compare the two at build time, so it gives the warning.  (Note the
         warning is a byproduct of the fact that gcc has no way of knowing
         whether the overflow function will be called, so the call isn't dead
         code and the warning attribute is activated.)
      
         So this warning seems to only indicate "this is an unusual pattern,
         maybe you should check it out" rather than "this is a bug".
      
         I get 102(!) of these warnings with allyesconfig and the
         __compiletime_object_size() gcc check removed.  I don't know if there
         are any real bugs hiding in there, but from looking at a small
         sample, I didn't see any.  According to Kees, it does sometimes find
         real bugs.  But the false positive rate seems high.
      
      3) "Buffer overflow detected" runtime warning
      
         This is a runtime warning where object size is const, and copy size >
         object size.
      
      All three warnings (both static and runtime) were completely disabled
      for gcc 4.6 with the following commit:
      
        2fb0815c ("gcc4: disable __compiletime_object_size for GCC 4.6+")
      
      That commit mistakenly assumed that the false positives were caused by a
      gcc bug in __compiletime_object_size().  But in fact,
      __compiletime_object_size() seems to be working fine.  The false
      positives were instead triggered by #2 above.  (Though I don't have an
      explanation for why the warnings supposedly only started showing up in
      gcc 4.6.)
      
      So remove warning #2 to get rid of all the false positives, and re-enable
      warnings #1 and #3 by reverting the above commit.
      
      Furthermore, since #1 is a real bug which is detected at compile time,
      upgrade it to always be an error.
      
      Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer
      needed.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d025d27
  5. 27 8月, 2016 1 次提交
  6. 25 8月, 2016 1 次提交
  7. 24 8月, 2016 2 次提交
  8. 18 8月, 2016 5 次提交
    • P
      kvm: nVMX: fix nested tsc scaling · c95ba92a
      Peter Feiner 提交于
      When the host supported TSC scaling, L2 would use a TSC multiplier of
      0, which causes a VM entry failure. Now L2's TSC uses the same
      multiplier as L1.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c95ba92a
    • R
      KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write · dccbfcf5
      Radim Krčmář 提交于
      If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
      write with vmcs02 as the current VMCS.
      This will incorrectly apply modifications intended for vmcs01 to vmcs02
      and L2 can use it to gain access to L0's x2APIC registers by disabling
      virtualized x2APIC while using msr bitmap that assumes enabled.
      
      Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
      current VMCS.  An alternative solution would temporarily make vmcs01 the
      current VMCS, but it requires more care.
      
      Fixes: 8d14695f ("x86, apicv: add virtual x2apic support")
      Reported-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      dccbfcf5
    • R
      KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC · d048c098
      Radim Krčmář 提交于
      msr bitmap can be used to avoid a VM exit (interception) on guest MSR
      accesses.  In some configurations of VMX controls, the guest can even
      directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
      APIC ACCESSES.
      
      L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
      To do so, L1 would first trick KVM to disable all possible interceptions
      by enabling APICv features and then would turn those features off;
      nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
      not intercept previously enabled MSRs even though they were not safe
      with the new configuration.
      
      Correctly re-enabling interceptions is not enough as a second bug would
      still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
      VMCSs, so L1 could trigger a race to get the desired combination of msr
      bitmap and VMX controls.
      
      This fix allocates a msr bitmap for every L1 VCPU, allows only safe
      x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
      have to intercept everything anyway.
      
      Fixes: 3af18d9c ("KVM: nVMX: Prepare for using hardware MSR bitmap")
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NWincy Van <fanwenyi0529@gmail.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d048c098
    • J
      x86/smp: Fix __max_logical_packages value setup · 7b0501b1
      Jiri Olsa 提交于
      Frank reported kernel panic when he disabled several cores in BIOS
      via following option:
      
        Core Disable Bitmap(Hex)   [0]
      
      with number 0xFFE, which leaves 16 CPUs in system (out of 48).
      
      The kernel panic below goes along with following messages:
      
       smpboot: Max logical packages: 2^M
       smpboot: APIC(0) Converting physical 0 to logical package 0^M
       smpboot: APIC(20) Converting physical 1 to logical package 1^M
       smpboot: APIC(40) Package 2 exceeds logical package map^M
       smpboot: CPU 8 APICId 40 disabled^M
       smpboot: APIC(60) Package 3 exceeds logical package map^M
       smpboot: CPU 12 APICId 60 disabled^M
       ...
       general protection fault: 0000 [#1] SMP^M
       Modules linked in:^M
       CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
       Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
       task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
       RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
       ...
        [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
        [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
        [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
        [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
        [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
        [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
        [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
        [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
        [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
        [<ffffffff816dc190>] ? rest_init+0x80/0x80^M
      
      The reason for the panic is wrong value of __max_logical_packages,
      which lets logical_package_map uninitialized and the uncore code
      relying on this map being properly initialized (maybe we should
      add some safety checks there as well).
      
      The __max_logical_packages is computed as:
      
        DIV_ROUND_UP(total_cpus, ncpus);
        - ncpus being number of cores
      
      With above BIOS setup we get total_cpus == 16 which set
      __max_logical_packages to 2 (ncpus is 12).
      
      Once topology_update_package_map processes CPU with logical
      pkg over 2 we display above messages and fail to initialize
      the physical_to_logical_pkg map, which makes the uncore code
      crash.
      
      The fix is to remove logical_package_map bitmap completely
      and keep and update the logical_packages number instead.
      
      After we enumerate all the present CPUs, we check if the
      enumerated logical packages count is within its computed
      maximum from BIOS data.
      
      If it's not the case, we set this maximum to the new enumerated
      value and freeze any new addition of logical packages.
      
      The freeze is because lot of init code like uncore/rapl/cqm
      depends on having maximum logical package value set to allocate
      their data, so we can't change it later on.
      
      Prarit Bhargava tested the patch and confirms that it solves
      the problem:
      
        From dmidecode:
                Core Count: 24
                Core Enabled: 24
                Thread Count: 48
      
      Orig kernel boot log:
      
       [    0.464981] smpboot: Max logical packages: 19
       [    0.469861] smpboot: APIC(0) Converting physical 0 to logical package 0
       [    0.477261] smpboot: APIC(40) Converting physical 1 to logical package 1
       [    0.484760] smpboot: APIC(80) Converting physical 2 to logical package 2
       [    0.492258] smpboot: APIC(c0) Converting physical 3 to logical package 3
      
      1.  nr_cpus=8, should stop enumerating in package 0:
      
       [    0.533664] smpboot: APIC(0) Converting physical 0 to logical package 0
       [    0.539596] smpboot: Max logical packages: 19
      
      2.  max_cpus=8, should still enumerate all packages:
      
       [    0.526494] smpboot: APIC(0) Converting physical 0 to logical package 0
       [    0.532428] smpboot: APIC(40) Converting physical 1 to logical package 1
       [    0.538456] smpboot: APIC(80) Converting physical 2 to logical package 2
       [    0.544486] smpboot: APIC(c0) Converting physical 3 to logical package 3
       [    0.550524] smpboot: Max logical packages: 19
      
      3.  nr_cpus=49 ( 2 socket + 1 core on 3rd socket), should stop enumerating in
          package 2:
      
       [    0.521378] smpboot: APIC(0) Converting physical 0 to logical package 0
       [    0.527314] smpboot: APIC(40) Converting physical 1 to logical package 1
       [    0.533345] smpboot: APIC(80) Converting physical 2 to logical package 2
       [    0.539368] smpboot: Max logical packages: 19
      
      4.  maxcpus=49, should still enumerate all packages:
      
       [    0.525591] smpboot: APIC(0) Converting physical 0 to logical package 0
       [    0.531525] smpboot: APIC(40) Converting physical 1 to logical package 1
       [    0.537547] smpboot: APIC(80) Converting physical 2 to logical package 2
       [    0.543579] smpboot: APIC(c0) Converting physical 3 to logical package 3
       [    0.549624] smpboot: Max logical packages: 19
      
      5.  kdump (nr_cpus=1) works as well.
      Reported-by: NFrank Ramsay <framsay@redhat.com>
      Tested-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Reviewed-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160815101700.GA30090@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7b0501b1
    • B
      x86/microcode/AMD: Fix initrd loading with CONFIG_RANDOMIZE_MEMORY=y · 88b2f634
      Borislav Petkov 提交于
      Similar to:
      
        efaad554 ("x86/microcode/intel: Fix initrd loading with CONFIG_RANDOMIZE_MEMORY=y")
      
      ... fix microcode loading from the initrd on AMD by adding the
      randomization offset to the microcode patch container within the initrd.
      Reported-and-tested-by: NBrian Gerst <brgerst@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-tip-commits@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160817113314.GA19221@nazgul.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      88b2f634
  9. 16 8月, 2016 3 次提交
  10. 12 8月, 2016 3 次提交
    • K
      perf/x86/intel/uncore: Add enable_box for client MSR uncore · 95f3be79
      Kan Liang 提交于
      There are bug reports about miscounting uncore counters on some
      client machines like Sandybridge, Broadwell and Skylake. It is
      very likely to be observed on idle systems.
      
      This issue is caused by a hardware issue. PERF_GLOBAL_CTL could be
      cleared after Package C7, and nothing will be count.
      The related errata (HSD 158) could be found in:
      
        www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf
      
      This patch tries to work around this issue by re-enabling PERF_GLOBAL_CTL
      in ->enable_box(). The workaround does not cover all cases. It helps for new
      events after returning from C7. But it cannot prevent C7, it will still
      miscount if a counter is already active.
      
      There is no drawback in leaving it enabled, so it does not need
      disable_box() here.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1470925874-59943-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95f3be79
    • K
      perf/x86/intel/uncore: Fix uncore num_counters · 10e9e7bd
      Kan Liang 提交于
      Some uncore boxes' num_counters value for Haswell server and
      Broadwell server are not correct (too large, off by one).
      
      This issue was found by comparing the code with the document. Although
      there is no bug report from users yet, accessing non-existent counters
      is dangerous and the behavior is undefined: it may cause miscounting or
      even crashes.
      
      This patch makes them consistent with the uncore document.
      Reported-by: NLukasz Odzioba <lukasz.odzioba@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1470925820-59847-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10e9e7bd
    • D
      uprobes/x86: Fix RIP-relative handling of EVEX-encoded instructions · 68187872
      Denys Vlasenko 提交于
      Since instruction decoder now supports EVEX-encoded instructions, two fixes
      are needed to correctly handle them in uprobes.
      
      Extended bits for MODRM.rm field need to be sanitized just like we do it
      for VEX3, to avoid encoding wrong register for register-relative access.
      
      EVEX has _two_ extended bits: b and x. Theoretically, EVEX.x should be
      ignored by the CPU (since GPRs go only up to 15, not 31), but let's be
      paranoid here: proper encoding for register-relative access
      should have EVEX.x = 1.
      
      Secondly, we should fetch vex.vvvv for EVEX too.
      This is now super easy because instruction decoder populates
      vex_prefix.bytes[2] for all flavors of (e)vex encodings, even for VEX2.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # v4.1+
      Fixes: 8a764a87 ("x86/asm/decoder: Create artificial 3rd byte for 2-byte VEX")
      Link: http://lkml.kernel.org/r/20160811154521.20469-1-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68187872
  11. 11 8月, 2016 10 次提交
    • S
      x86/apic/x2apic, smp/hotplug: Don't use before alloc in x2apic_cluster_probe() · d52c0569
      Sebastian Andrzej Siewior 提交于
      I made a mistake while converting the driver to the hotplug state
      machine and as a result x2apic_cluster_probe() was accessing
      cpus_in_cluster before allocating it.
      
      This patch fixes it by setting the cpumask after the allocation the
      memory succeeded.
      
      While at it, I marked two functions static which are only used within
      this file.
      Reported-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6b2c2847 ("x86/x2apic: Convert to CPU hotplug state machine")
      Link: http://lkml.kernel.org/r/1470924515-9444-1-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d52c0569
    • A
      x86/platform/uv: Skip UV runtime services mapping in the efi_runtime_disabled case · f72075c9
      Alex Thorlton 提交于
      This problem has actually been in the UV code for a while, but we didn't
      catch it until recently, because we had been relying on EFI_OLD_MEMMAP
      to allow our systems to boot for a period of time.  We noticed the issue
      when trying to kexec a recent community kernel, where we hit this NULL
      pointer dereference in efi_sync_low_kernel_mappings():
      
       [    0.337515] BUG: unable to handle kernel NULL pointer dereference at 0000000000000880
       [    0.346276] IP: [<ffffffff8105df8d>] efi_sync_low_kernel_mappings+0x5d/0x1b0
      
      The problem doesn't show up with EFI_OLD_MEMMAP because we skip the
      chunk of setup_efi_state() that sets the efi_loader_signature for the
      kexec'd kernel.  When the kexec'd kernel boots, it won't set EFI_BOOT in
      setup_arch, so we completely avoid the bug.
      
      We always kexec with noefi on the command line, so this shouldn't be an
      issue, but since we're not actually checking for efi_runtime_disabled in
      uv_bios_init(), we end up trying to do EFI runtime callbacks when we
      shouldn't be. This patch just adds a check for efi_runtime_disabled in
      uv_bios_init() so that we don't map in uv_systab when runtime_disabled ==
      true.
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: <stable@vger.kernel.org> # v4.7
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470912120-22831-2-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f72075c9
    • A
      x86/efi: Allocate a trampoline if needed in efi_free_boot_services() · 5bc653b7
      Andy Lutomirski 提交于
      On my Dell XPS 13 9350 with firmware 1.4.4 and SGX on, if I boot
      Fedora 24's grub2-efi off a hard disk, my first 1MB of RAM looks
      like:
      
       efi: mem00: [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000000000-0x0000000000000fff] (0MB)
       efi: mem01: [Boot Data          |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000001000-0x0000000000027fff] (0MB)
       efi: mem02: [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000028000-0x0000000000029fff] (0MB)
       efi: mem03: [Reserved           |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002a000-0x000000000002bfff] (0MB)
       efi: mem04: [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002c000-0x000000000002cfff] (0MB)
       efi: mem05: [Loader Data        |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002d000-0x000000000002dfff] (0MB)
       efi: mem06: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000002e000-0x0000000000057fff] (0MB)
       efi: mem07: [Reserved           |   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000058000-0x0000000000058fff] (0MB)
       efi: mem08: [Conventional Memory|   |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000000059000-0x000000000009ffff] (0MB)
      
      My EBDA is at 0x2c000, which blocks off everything from 0x2c000 and
      up, and my trampoline is 0x6000 bytes (6 pages), so it doesn't fit
      in the loader data range at 0x28000.
      
      Without this patch, it panics due to a failure to allocate the
      trampoline.  With this patch, it works:
      
       [  +0.001744] Base memory trampoline at [ffff880000001000] 1000 size 24576
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/998c77b3bf709f3dfed85cb30701ed1a5d8a438b.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5bc653b7
    • A
      x86/boot: Rework reserve_real_mode() to allow multiple tries · 5ff3e2c3
      Andy Lutomirski 提交于
      If reserve_real_mode() fails, panicing immediately means we're
      doomed.  Make it safe to try more than once to allocate the
      trampoline:
      
       - Degrade a failure from panic() to pr_info().  (If we make it to
         setup_real_mode() without reserving the trampoline, we'll panic
         them.)
      
       - Factor out helpers so that platform code can supply a specific
         address to try.
      
       - Warn if reserve_real_mode() is called after we're done with the
         memblock allocator.  If that were to happen, we would behave
         unpredictably.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/876e383038f3e9971aa72fd20a4f5da05f9d193d.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ff3e2c3
    • A
      x86/boot: Defer setup_real_mode() to early_initcall time · d0de0f68
      Andy Lutomirski 提交于
      There's no need to run setup_real_mode() as early as we run it.
      Defer it to the same early_initcall that sets up the page
      permissions for the real mode code.
      
      This should be a code size reduction.  More importantly, it give us
      a longer window in which we can allocate the real mode trampoline.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d0de0f68
    • A
      x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directly · 18bc7bd5
      Andy Lutomirski 提交于
      The initialization process for trampoline_cr4_features and
      mmu_cr4_features was confusing.  The intent is for mmu_cr4_features
      and *trampoline_cr4_features to stay in sync, but
      trampoline_cr4_features is NULL until setup_real_mode() runs.  The
      old code synchronized *trampoline_cr4_features *twice*, once in
      setup_real_mode() and once in setup_arch().  It also initialized
      mmu_cr4_features in setup_real_mode(), which causes the actual value
      of mmu_cr4_features to potentially depend on when setup_real_mode()
      is called.
      
      With this patch, mmu_cr4_features is initialized directly in
      setup_arch(), and *trampoline_cr4_features is synchronized to
      mmu_cr4_features when the trampoline is set up.
      
      After this patch, it should be safe to defer setup_real_mode().
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/d48a263f9912389b957dd495a7127b009259ffe0.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18bc7bd5
    • A
      x86/boot: Run reserve_bios_regions() after we initialize the memory map · 007b7560
      Andy Lutomirski 提交于
      reserve_bios_regions() is a quirk that reserves memory that we might
      otherwise think is available.  There's no need to run it so early,
      and running it before we have the memory map initialized with its
      non-quirky inputs makes it hard to make reserve_bios_regions() more
      intelligent.
      
      Move it right after we populate the memblock state.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mario Limonciello <mario_limonciello@dell.com>
      Cc: Matt Fleming <mfleming@suse.de>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      007b7560
    • A
      x86/irq: Do not substract irq_tlb_count from irq_call_count · 82ba4fac
      Aaron Lu 提交于
      Since commit:
      
        52aec330 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")
      
      the TLB remote shootdown is done through call function vector. That
      commit didn't take care of irq_tlb_count, which a later commit:
      
        fd0f5869 ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")
      
      ... tried to fix.
      
      The fix assumes every increase of irq_tlb_count has a corresponding
      increase of irq_call_count. So the irq_call_count is always bigger than
      irq_tlb_count and we could substract irq_tlb_count from irq_call_count.
      
      Unfortunately this is not true for the smp_call_function_single() case.
      The IPI is only sent if the target CPU's call_single_queue is empty when
      adding a csd into it in generic_exec_single. That means if two threads
      are both adding flush tlb csds to the same CPU's call_single_queue, only
      one IPI is sent. In other words, the irq_call_count is incremented by 1
      but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
      bigger than irq_call_count and the substract will produce a very large
      irq_call_count value due to overflow.
      
      Considering that:
      
        1) it's not worth to send more IPIs for the sake of accurate counting of
           irq_call_count in generic_exec_single();
      
        2) it's not easy to tell if the call function interrupt is for TLB
           shootdown in __smp_call_function_single_interrupt().
      
      Not to exclude TLB shootdown from call function count seems to be the
      simplest fix and this patch just does that.
      
      This bug was found by LKP's cyclic performance regression tracking recently
      with the vm-scalability test suite. I have bisected to commit:
      
        3dec0ba0 ("mm/rmap: share the i_mmap_rwsem")
      
      This commit didn't do anything wrong but revealed the irq_call_count
      problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
      concurrent with multiple threads.  When remap_one is try_to_unmap_one(),
      then multiple threads could queue flush TLB to the same CPU but only
      one IPI will be sent.
      
      Since the commit was added in Linux v3.19, the counting problem only
      shows up from v3.19 onwards.
      Signed-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
      Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      82ba4fac
    • D
      x86/mm: Fix swap entry comment and macro · ace7fab7
      Dave Hansen 提交于
      A recent patch changed the format of a swap PTE.
      
      The comment explaining the format of the swap PTE is wrong about
      the bits used for the swap type field.  Amusingly, the ASCII art
      and the patch description are correct, but the comment itself
      is wrong.
      
      As I was looking at this, I also noticed that the
      SWP_OFFSET_FIRST_BIT has an off-by-one error.  This does not
      really hurt anything.  It just wasted a bit of space in the PTE,
      giving us 2^59 bytes of addressable space in our swapfiles
      instead of 2^60.  But, it doesn't match with the comments, and it
      wastes a bit of space, so fix it.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Fixes: 00839ee3 ("x86/mm: Move swap offset/type up in PTE to work around erratum")
      Link: http://lkml.kernel.org/r/20160810172325.E56AD7DA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ace7fab7
    • N
      x86/mm/kaslr: Fix -Wformat-security warning · 62d16b5a
      Nicolas Iooss 提交于
      debug_putstr() is used to output strings without using printf-like
      formatting but debug_putstr(v) is defined as early_printk(v) in
      arch/x86/lib/kaslr.c.
      
      This makes clang reports the following warning when building
      with -Wformat-security:
      
          arch/x86/lib/kaslr.c:57:15: warning: format string is not a string
          literal (potentially insecure) [-Wformat-security]
                  debug_putstr(purpose);
                               ^~~~~~~
      
      Fix this by using "%s" in early_printk().
      Signed-off-by: NNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160806102039.27221-1-nicolas.iooss_linux@m4x.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      62d16b5a
  12. 10 8月, 2016 1 次提交
    • D
      x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer manipulation · b79daf85
      Dave Hansen 提交于
      The Memory Protection Keys "rights register" (PKRU) is
      XSAVE-managed, and is saved/restored along with the FPU state.
      
      When kernel code accesses FPU regsisters, it does a delicate
      dance with preempt.  Otherwise, the context switching code can
      get confused as to whether the most up-to-date state is in the
      registers themselves or in the XSAVE buffer.
      
      But, PKRU is not a normal FPU register.  Using it does not
      generate the normal device-not-available (#NM) exceptions which
      means we can not manage it lazily, and the kernel completley
      disallows using lazy mode when it is enabled.
      
      The dance with preempt *only* occurs when managing the FPU
      lazily.  Since we never manage PKRU lazily, we do not have to do
      the dance with preempt; we can access it directly.  Doing it
      this way saves a ton of complicated code (and is faster too).
      
      Further, the XSAVES reenabling failed to patch a bit of code
      in fpu__xfeature_set_state() the checked for compacted buffers.
      That check caused fpu__xfeature_set_state() to silently refuse to
      work when the kernel is using compacted XSAVE buffers.  This
      broke execute-only and future pkey_mprotect() support when using
      compact XSAVE buffers.
      
      But, removing fpu__xfeature_set_state() gets rid of this issue,
      in addition to the nice cleanup and speedup.
      
      This fixes the same thing as a fix that Sai posted:
      
        https://lkml.org/lkml/2016/7/25/637
      
      The fix that he posted is a much more obviously correct, but I
      think we should just do this instead.
      Reported-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b79daf85