1. 20 7月, 2018 11 次提交
    • P
      x86/tsc: Redefine notsc to behave as tsc=unstable · fe9af81e
      Pavel Tatashin 提交于
      Currently, the notsc kernel parameter disables the use of the TSC by
      sched_clock(). However, this parameter does not prevent the kernel from
      accessing tsc in other places.
      
      The only rationale to boot with notsc is to avoid timing discrepancies on
      multi-socket systems where TSC are not properly synchronized, and thus
      exclude TSC from being used for time keeping. But that prevents using TSC
      as sched_clock() as well, which is not necessary as the core sched_clock()
      implementation can handle non synchronized TSC based sched clocks just
      fine.
      
      However, there is another method to solve the above problem: booting with
      tsc=unstable parameter. This parameter allows sched_clock() to use TSC and
      just excludes it from timekeeping.
      
      So there is no real reason to keep notsc, but for compatibility reasons the
      parameter has to stay. Make it behave like 'tsc=unstable' instead.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-12-pasha.tatashin@oracle.com
      fe9af81e
    • B
      x86/CPU: Call detect_nopl() only on the BSP · 9b3661cd
      Borislav Petkov 提交于
      Make it use the setup_* variants and have it be called only on the BSP and
      drop the call in generic_identify() - X86_FEATURE_NOPL will be replicated
      to the APs through the forced caps. Helps to keep the mess at a manageable
      level.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-11-pasha.tatashin@oracle.com
      9b3661cd
    • P
      x86/jump_label: Initialize static branching early · 8990cac6
      Pavel Tatashin 提交于
      Static branching is useful to runtime patch branches that are used in hot
      path, but are infrequently changed.
      
      The x86 clock framework is one example that uses static branches to setup
      the best clock during boot and never changes it again.
      
      It is desired to enable the TSC based sched clock early to allow fine
      grained boot time analysis early on. That requires the static branching
      functionality to be functional early as well.
      
      Static branching requires patching nop instructions, thus,
      arch_init_ideal_nops() must be called prior to jump_label_init().
      
      Do all the necessary steps to call arch_init_ideal_nops() right after
      early_cpu_init(), which also allows to insert a call to jump_label_init()
      right after that. jump_label_init() will be called again from the generic
      init code, but the code is protected against reinitialization already.
      
      [ tglx: Massaged changelog ]
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-10-pasha.tatashin@oracle.com
      8990cac6
    • P
      x86/alternatives, jumplabel: Use text_poke_early() before mm_init() · 6fffacb3
      Pavel Tatashin 提交于
      It supposed to be safe to modify static branches after jump_label_init().
      But, because static key modifying code eventually calls text_poke() it can
      end up accessing a struct page which has not been initialized yet.
      
      Here is how to quickly reproduce the problem. Insert code like this
      into init/main.c:
      
      | +static DEFINE_STATIC_KEY_FALSE(__test);
      | asmlinkage __visible void __init start_kernel(void)
      | {
      |        char *command_line;
      |@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
      |        vfs_caches_init_early();
      |        sort_main_extable();
      |        trap_init();
      |+       {
      |+       static_branch_enable(&__test);
      |+       WARN_ON(!static_branch_likely(&__test));
      |+       }
      |        mm_init();
      
      The following warnings show-up:
      WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
      RIP: 0010:text_poke+0x20d/0x230
      Call Trace:
       ? text_poke_bp+0x50/0xda
       ? arch_jump_label_transform+0x89/0xe0
       ? __jump_label_update+0x78/0xb0
       ? static_key_enable_cpuslocked+0x4d/0x80
       ? static_key_enable+0x11/0x20
       ? start_kernel+0x23e/0x4c8
       ? secondary_startup_64+0xa5/0xb0
      
      ---[ end trace abdc99c031b8a90a ]---
      
      If the code above is moved after mm_init(), no warning is shown, as struct
      pages are initialized during handover from memblock.
      
      Use text_poke_early() in static branching until early boot IRQs are enabled
      and from there switch to text_poke. Also, ensure text_poke() is never
      invoked when unitialized memory access may happen by using adding a
      !after_bootmem assertion.
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-9-pasha.tatashin@oracle.com
      6fffacb3
    • T
      x86/kvmclock: Switch kvmclock data to a PER_CPU variable · 95a3d445
      Thomas Gleixner 提交于
      The previous removal of the memblock dependency from kvmclock introduced a
      static data array sized 64bytes * CONFIG_NR_CPUS. That's wasteful on large
      systems when kvmclock is not used.
      
      Replace it with:
      
       - A static page sized array of pvclock data. It's page sized because the
         pvclock data of the boot cpu is mapped into the VDSO so otherwise random
         other data would be exposed to the vDSO
      
       - A PER_CPU variable of pvclock data pointers. This is used to access the
         pcvlock data storage on each CPU.
      
      The setup is done in two stages:
      
       - Early boot stores the pointer to the static page for the boot CPU in
         the per cpu data.
      
       - In the preparatory stage of CPU hotplug assign either an element of
         the static array (when the CPU number is in that range) or allocate
         memory and initialize the per cpu pointer.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-8-pasha.tatashin@oracle.com
      95a3d445
    • T
      x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock · e499a9b6
      Thomas Gleixner 提交于
      There is no point to have this in the kvm code itself and call it from
      there. This can be called from an initcall and the parameter is cleared
      when the hypervisor is not KVM.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-7-pasha.tatashin@oracle.com
      e499a9b6
    • T
      x86/kvmclock: Mark variables __initdata and __ro_after_init · 42f8df93
      Thomas Gleixner 提交于
      The kvmclock parameter is init data and the other variables are not
      modified after init.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-6-pasha.tatashin@oracle.com
      42f8df93
    • T
      x86/kvmclock: Cleanup the code · 146c394d
      Thomas Gleixner 提交于
      - Cleanup the mrs write for wall clock. The type casts to (int) are sloppy
        because the wrmsr parameters are u32 and aside of that wrmsrl() already
        provides the high/low split for free.
      
      - Remove the pointless get_cpu()/put_cpu() dance from various
        functions. Either they are called during early init where CPU is
        guaranteed to be 0 or they are already called from non preemptible
        context where smp_processor_id() can be used safely
      
      - Simplify the convoluted check for kvmclock in the init function.
      
      - Mark the parameter parsing function __init. No point in keeping it
        around.
      
      - Convert to pr_info()
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-5-pasha.tatashin@oracle.com
      146c394d
    • T
      x86/kvmclock: Decrapify kvm_register_clock() · 7a5ddc8f
      Thomas Gleixner 提交于
      The return value is pointless because the wrmsr cannot fail if
      KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set.
      
      kvm_register_clock() is only called locally so wants to be static.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-4-pasha.tatashin@oracle.com
      7a5ddc8f
    • T
      x86/kvmclock: Remove page size requirement from wall_clock · 7ef363a3
      Thomas Gleixner 提交于
      There is no requirement for wall_clock data to be page aligned or page
      sized.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-3-pasha.tatashin@oracle.com
      7ef363a3
    • P
      x86/kvmclock: Remove memblock dependency · 368a540e
      Pavel Tatashin 提交于
      KVM clock is initialized later compared to other hypervisor clocks because
      it has a dependency on the memblock allocator.
      
      Bring it in line with other hypervisors by using memory from the BSS
      instead of allocating it.
      
      The benefits:
      
        - Remove ifdef from common code
        - Earlier availability of the clock
        - Remove dependency on memblock, and reduce code
      
      The downside:
      
        - Static allocation of the per cpu data structures sized NR_CPUS * 64byte
          Will be addressed in follow up patches.
      
      [ tglx: Split out from larger series ]
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-2-pasha.tatashin@oracle.com
      368a540e
  2. 18 7月, 2018 2 次提交
    • P
      kvmclock: fix TSC calibration for nested guests · e10f7805
      Peng Hao 提交于
      Inside a nested guest, access to hardware can be slow enough that
      tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work
      to be called periodically and the nested guest to spend a lot of time
      reading the ACPI timer.
      
      However, if the TSC frequency is available from the pvclock page,
      we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration.
      'refine' operation.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
      [Commit message rewritten. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e10f7805
    • L
      KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled · 2307af1c
      Liran Alon 提交于
      When eVMCS is enabled, all VMCS allocated to be used by KVM are marked
      with revision_id of KVM_EVMCS_VERSION instead of revision_id reported
      by MSR_IA32_VMX_BASIC.
      
      However, even though not explictly documented by TLFS, VMXArea passed
      as VMXON argument should still be marked with revision_id reported by
      physical CPU.
      
      This issue was found by the following setup:
      * L0 = KVM which expose eVMCS to it's L1 guest.
      * L1 = KVM which consume eVMCS reported by L0.
      This setup caused the following to occur:
      1) L1 execute hardware_enable().
      2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON.
      3) L0 intercept L1 VMXON and execute handle_vmon() which notes
      vmxarea->revision_id != VMCS12_REVISION and therefore fails with
      nested_vmx_failInvalid() which sets RFLAGS.CF.
      4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore
      hardware_enable() continues as usual.
      5) L1 hardware_enable() then calls ept_sync_global() which executes
      INVEPT.
      6) L0 intercept INVEPT and execute handle_invept() which notes
      !vmx->nested.vmxon and thus raise a #UD to L1.
      7) Raised #UD caused L1 to panic.
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 773e8a04Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2307af1c
  3. 15 7月, 2018 6 次提交
  4. 13 7月, 2018 1 次提交
  5. 12 7月, 2018 1 次提交
  6. 11 7月, 2018 1 次提交
    • A
      efi/x86: Fix mixed mode reboot loop by removing pointless call to PciIo->Attributes() · e2967018
      Ard Biesheuvel 提交于
      Hans de Goede reported that his mixed EFI mode Bay Trail tablet
      would not boot at all any more, but enter a reboot loop without
      any logs printed by the kernel.
      
      Unbreak 64-bit Linux/x86 on 32-bit UEFI:
      
      When it was first introduced, the EFI stub code that copies the
      contents of PCI option ROMs originally only intended to do so if
      the EFI_PCI_IO_ATTRIBUTE_EMBEDDED_ROM attribute was *not* set.
      
      The reason was that the UEFI spec permits PCI option ROM images
      to be provided by the platform directly, rather than via the ROM
      BAR, and in this case, the OS can only access them at runtime if
      they are preserved at boot time by copying them from the areas
      described by PciIo->RomImage and PciIo->RomSize.
      
      However, it implemented this check erroneously, as can be seen in
      commit:
      
        dd5fc854 ("EFI: Stash ROMs if they're not in the PCI BAR")
      
      which introduced:
      
          if (!attributes & EFI_PCI_IO_ATTRIBUTE_EMBEDDED_ROM)
                  continue;
      
      and given that the numeric value of EFI_PCI_IO_ATTRIBUTE_EMBEDDED_ROM
      is 0x4000, this condition never becomes true, and so the option ROMs
      were copied unconditionally.
      
      This was spotted and 'fixed' by commit:
      
        886d751a ("x86, efi: correct precedence of operators in setup_efi_pci")
      
      but inadvertently inverted the logic at the same time, defeating
      the purpose of the code, since it now only preserves option ROM
      images that can be read from the ROM BAR as well.
      
      Unsurprisingly, this broke some systems, and so the check was removed
      entirely in the following commit:
      
        73970188 ("x86, efi: remove attribute check from setup_efi_pci")
      
      It is debatable whether this check should have been included in the
      first place, since the option ROM image provided to the UEFI driver by
      the firmware may be different from the one that is actually present in
      the card's flash ROM, and so whatever PciIo->RomImage points at should
      be preferred regardless of whether the attribute is set.
      
      As this was the only use of the attributes field, we can remove
      the call to PciIo->Attributes() entirely, which is especially
      nice because its prototype involves uint64_t type by-value
      arguments which the EFI mixed mode has trouble dealing with.
      
      Any mixed mode system with PCI is likely to be affected.
      Tested-by: NWilfried Klaebe <linux-kernel@lebenslange-mailadresse.de>
      Tested-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180711090235.9327-2-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e2967018
  7. 08 7月, 2018 1 次提交
  8. 06 7月, 2018 1 次提交
  9. 03 7月, 2018 11 次提交
  10. 01 7月, 2018 1 次提交
  11. 29 6月, 2018 1 次提交
    • N
      x86/e820: put !E820_TYPE_RAM regions into memblock.reserved · 124049de
      Naoya Horiguchi 提交于
      There is a kernel panic that is triggered when reading /proc/kpageflags
      on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
      the default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      within the reserved unavailable range (i.e.  memblock.memory &&
      !memblock.reserved).  This patch utilizes it to cover all unavailable
      ranges by putting them into memblock.reserved.
      
      Link: http://lkml.kernel.org/r/20180615072947.GB23273@hori1.linux.bs1.fc.nec.co.jp
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: N"Herton R. Krzesinski" <herton@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      124049de
  12. 27 6月, 2018 3 次提交