1. 09 1月, 2020 6 次提交
  2. 05 1月, 2020 1 次提交
    • D
      mm/memory_hotplug: shrink zones when offlining memory · feee6b29
      David Hildenbrand 提交于
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feee6b29
  3. 19 12月, 2019 2 次提交
  4. 17 12月, 2019 6 次提交
    • A
      perf/x86/intel: Fix PT PMI handling · 92ca7da4
      Alexander Shishkin 提交于
      Commit:
      
        ccbebba4 ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it")
      
      skips the PT/LBR exclusivity check on CPUs where PT and LBRs coexist, but
      also inadvertently skips the active_events bump for PT in that case, which
      is a bug. If there aren't any hardware events at the same time as PT, the
      PMI handler will ignore PT PMIs, as active_events reads zero in that case,
      resulting in the "Uhhuh" spurious NMI warning and PT data loss.
      
      Fix this by always increasing active_events for PT events.
      
      Fixes: ccbebba4 ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it")
      Reported-by: NVitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lkml.kernel.org/r/20191210105101.77210-1-alexander.shishkin@linux.intel.com
      92ca7da4
    • A
      perf/x86/intel/bts: Fix the use of page_private() · ff61541c
      Alexander Shishkin 提交于
      Commit
      
        8062382c ("perf/x86/intel/bts: Add BTS PMU driver")
      
      brought in a warning with the BTS buffer initialization
      that is easily tripped with (assuming KPTI is disabled):
      
      instantly throwing:
      
      > ------------[ cut here ]------------
      > WARNING: CPU: 2 PID: 326 at arch/x86/events/intel/bts.c:86 bts_buffer_setup_aux+0x117/0x3d0
      > Modules linked in:
      > CPU: 2 PID: 326 Comm: perf Not tainted 5.4.0-rc8-00291-gceb9e773 #904
      > RIP: 0010:bts_buffer_setup_aux+0x117/0x3d0
      > Call Trace:
      >  rb_alloc_aux+0x339/0x550
      >  perf_mmap+0x607/0xc70
      >  mmap_region+0x76b/0xbd0
      ...
      
      It appears to assume (for lost raisins) that PagePrivate() is set,
      while later it actually tests for PagePrivate() before using
      page_private().
      
      Make it consistent and always check PagePrivate() before using
      page_private().
      
      Fixes: 8062382c ("perf/x86/intel/bts: Add BTS PMU driver")
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lkml.kernel.org/r/20191205142853.28894-2-alexander.shishkin@linux.intel.com
      ff61541c
    • P
      perf/x86: Fix potential out-of-bounds access · 1e69a0ef
      Peter Zijlstra 提交于
      UBSAN reported out-of-bound accesses for x86_pmu.event_map(), it's
      arguments should be < x86_pmu.max_events. Make sure all users observe
      this constraint.
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      1e69a0ef
    • J
      x86/mce: Fix possibly incorrect severity calculation on AMD · a3a57dda
      Jan H. Schönherr 提交于
      The function mce_severity_amd_smca() requires m->bank to be initialized
      for correct operation. Fix the one case, where mce_severity() is called
      without doing so.
      
      Fixes: 6bda529e ("x86/mce: Grade uncorrected errors for SMCA-enabled systems")
      Fixes: d28af26f ("x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()")
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
      Link: https://lkml.kernel.org/r/20191210000733.17979-4-jschoenh@amazon.de
      a3a57dda
    • Y
      x86/MCE/AMD: Allow Reserved types to be overwritten in smca_banks[] · 966af209
      Yazen Ghannam 提交于
      Each logical CPU in Scalable MCA systems controls a unique set of MCA
      banks in the system. These banks are not shared between CPUs. The bank
      types and ordering will be the same across CPUs on currently available
      systems.
      
      However, some CPUs may see a bank as Reserved/Read-as-Zero (RAZ) while
      other CPUs do not. In this case, the bank seen as Reserved on one CPU is
      assumed to be the same type as the bank seen as a known type on another
      CPU.
      
      In general, this occurs when the hardware represented by the MCA bank
      is disabled, e.g. disabled memory controllers on certain models, etc.
      The MCA bank is disabled in the hardware, so there is no possibility of
      getting an MCA/MCE from it even if it is assumed to have a known type.
      
      For example:
      
      Full system:
      	Bank  |  Type seen on CPU0  |  Type seen on CPU1
      	------------------------------------------------
      	 0    |         LS          |          LS
      	 1    |         UMC         |          UMC
      	 2    |         CS          |          CS
      
      System with hardware disabled:
      	Bank  |  Type seen on CPU0  |  Type seen on CPU1
      	------------------------------------------------
      	 0    |         LS          |          LS
      	 1    |         UMC         |          RAZ
      	 2    |         CS          |          CS
      
      For this reason, there is a single, global struct smca_banks[] that is
      initialized at boot time. This array is initialized on each CPU as it
      comes online. However, the array will not be updated if an entry already
      exists.
      
      This works as expected when the first CPU (usually CPU0) has all
      possible MCA banks enabled. But if the first CPU has a subset, then it
      will save a "Reserved" type in smca_banks[]. Successive CPUs will then
      not be able to update smca_banks[] even if they encounter a known bank
      type.
      
      This may result in unexpected behavior. Depending on the system
      configuration, a user may observe issues enumerating the MCA
      thresholding sysfs interface. The issues may be as trivial as sysfs
      entries not being available, or as severe as system hangs.
      
      For example:
      
      	Bank  |  Type seen on CPU0  |  Type seen on CPU1
      	------------------------------------------------
      	 0    |         LS          |          LS
      	 1    |         RAZ         |          UMC
      	 2    |         CS          |          CS
      
      Extend the smca_banks[] entry check to return if the entry is a
      non-reserved type. Otherwise, continue so that CPUs that encounter a
      known bank type can update smca_banks[].
      
      Fixes: 68627a69 ("x86/mce/AMD, EDAC/mce_amd: Enumerate Reserved SMCA bank type")
      Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191121141508.141273-1-Yazen.Ghannam@amd.com
      966af209
    • K
      x86/MCE/AMD: Do not use rdmsr_safe_on_cpu() in smca_configure() · 246ff09f
      Konstantin Khlebnikov 提交于
      ... because interrupts are disabled that early and sending IPIs can
      deadlock:
      
        BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
        in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
        no locks held by swapper/1/0.
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8106dda9>] copy_process+0x8b9/0x1ca0
        softirqs last  enabled at (0): [<ffffffff8106dda9>] copy_process+0x8b9/0x1ca0
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        Preemption disabled at:
        [<ffffffff8104703b>] start_secondary+0x3b/0x190
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.5.0-rc2+ #1
        Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
        Call Trace:
         dump_stack
         ___might_sleep.cold.92
         wait_for_completion
         ? generic_exec_single
         rdmsr_safe_on_cpu
         ? wrmsr_on_cpus
         mce_amd_feature_init
         mcheck_cpu_init
         identify_cpu
         identify_secondary_cpu
         smp_store_cpu_info
         start_secondary
         secondary_startup_64
      
      The function smca_configure() is called only on the current CPU anyway,
      therefore replace rdmsr_safe_on_cpu() with atomic rdmsr_safe() and avoid
      the IPI.
      
       [ bp: Update commit message. ]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NYazen Ghannam <yazen.ghannam@amd.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/157252708836.3876.4604398213417262402.stgit@buzz
      246ff09f
  5. 14 12月, 2019 1 次提交
  6. 11 12月, 2019 1 次提交
  7. 10 12月, 2019 1 次提交
  8. 05 12月, 2019 2 次提交
    • M
      arch: sembuf.h: make uapi asm/sembuf.h self-contained · 0fb9dc28
      Masahiro Yamada 提交于
      Userspace cannot compile <asm/sembuf.h> due to some missing type
      definitions.  For example, building it for x86 fails as follows:
      
          CC      usr/include/asm/sembuf.h.s
        In file included from <command-line>:32:0:
        usr/include/asm/sembuf.h:17:20: error: field `sem_perm' has incomplete type
          struct ipc64_perm sem_perm; /* permissions .. see ipc.h */
                            ^~~~~~~~
        usr/include/asm/sembuf.h:24:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t sem_otime; /* last semop time */
          ^~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:25:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused1;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:26:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t sem_ctime; /* last change time */
          ^~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:27:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused2;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:29:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t sem_nsems; /* no. of semaphores in array */
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:30:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused3;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:31:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused4;
          ^~~~~~~~~~~~~~~~
      
      It is just a matter of missing include directive.
      
      Include <asm/ipcbuf.h> to make it self-contained, and add it to
      the compile-test coverage.
      
      Link: http://lkml.kernel.org/r/20191030063855.9989-3-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fb9dc28
    • M
      arch: msgbuf.h: make uapi asm/msgbuf.h self-contained · 9ef0e004
      Masahiro Yamada 提交于
      Userspace cannot compile <asm/msgbuf.h> due to some missing type
      definitions.  For example, building it for x86 fails as follows:
      
          CC      usr/include/asm/msgbuf.h.s
        In file included from usr/include/asm/msgbuf.h:6:0,
                         from <command-line>:32:
        usr/include/asm-generic/msgbuf.h:25:20: error: field `msg_perm' has incomplete type
          struct ipc64_perm msg_perm;
                            ^~~~~~~~
        usr/include/asm-generic/msgbuf.h:27:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_stime; /* last msgsnd time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:28:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_rtime; /* last msgrcv time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:29:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_ctime; /* last change time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:41:2: error: unknown type name `__kernel_pid_t'
          __kernel_pid_t msg_lspid; /* pid of last msgsnd */
          ^~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:42:2: error: unknown type name `__kernel_pid_t'
          __kernel_pid_t msg_lrpid; /* last receive pid */
          ^~~~~~~~~~~~~~
      
      It is just a matter of missing include directive.
      
      Include <asm/ipcbuf.h> to make it self-contained, and add it to
      the compile-test coverage.
      
      Link: http://lkml.kernel.org/r/20191030063855.9989-2-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ef0e004
  9. 04 12月, 2019 3 次提交
    • J
      kvm: vmx: Stop wasting a page for guest_msrs · 7d73710d
      Jim Mattson 提交于
      We will never need more guest_msrs than there are indices in
      vmx_msr_index. Thus, at present, the guest_msrs array will not exceed
      168 bytes.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7d73710d
    • P
      KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332) · 433f4ba1
      Paolo Bonzini 提交于
      The bounds check was present in KVM_GET_SUPPORTED_CPUID but not
      KVM_GET_EMULATED_CPUID.
      
      Reported-by: syzbot+e3f4897236c4eeb8af4f@syzkaller.appspotmail.com
      Fixes: 84cffe49 ("kvm: Emulate MOVBE", 2013-10-29)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      433f4ba1
    • D
      x86/efi: Update e820 with reserved EFI boot services data to fix kexec breakage · af164898
      Dave Young 提交于
      Michael Weiser reported that he got this error during a kexec rebooting:
      
        esrt: Unsupported ESRT version 2904149718861218184.
      
      The ESRT memory stays in EFI boot services data, and it was reserved
      in kernel via efi_mem_reserve().  The initial purpose of the reservation
      is to reuse the EFI boot services data across kexec reboot. For example
      the BGRT image data and some ESRT memory like Michael reported.
      
      But although the memory is reserved it is not updated in the X86 E820 table,
      and kexec_file_load() iterates system RAM in the IO resource list to find places
      for kernel, initramfs and other stuff. In Michael's case the kexec loaded
      initramfs overwrote the ESRT memory and then the failure happened.
      
      Since kexec_file_load() depends on the E820 table being updated, just fix this
      by updating the reserved EFI boot services memory as reserved type in E820.
      
      Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are
      bypassed in the reservation code path because they are assumed as reserved.
      
      But the reservation is still needed for multiple kexec reboots,
      and it is the only possible case we come here thus just drop the code
      chunk, then everything works without side effects.
      
      On my machine the ESRT memory sits in an EFI runtime data range, it does
      not trigger the problem, but I successfully tested with BGRT instead.
      both kexec_load() and kexec_file_load() work and kdump works as well.
      
      [ mingo: Edited the changelog. ]
      Reported-by: NMichael Weiser <michael@weiser.dinsnail.net>
      Tested-by: NMichael Weiser <michael@weiser.dinsnail.net>
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kexec@lists.infradead.org
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191204075233.GA10520@dhcp-128-65.nay.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      af164898
  10. 02 12月, 2019 2 次提交
    • D
      x86/kasan: support KASAN_VMALLOC · 0609ae01
      Daniel Axtens 提交于
      In the case where KASAN directly allocates memory to back vmalloc space,
      don't map the early shadow page over it.
      
      We prepopulate pgds/p4ds for the range that would otherwise be empty.
      This is required to get it synced to hardware on boot, allowing the
      lower levels of the page tables to be filled dynamically.
      
      Link: http://lkml.kernel.org/r/20191031093909.9228-5-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0609ae01
    • I
      x86/mm/pat: Fix off-by-one bugs in interval tree search · 91298f1a
      Ingo Molnar 提交于
      There's a bug in the new PAT code, the conversion of memtype_check_conflict()
      is buggy:
      
         8d04a5f9: ("x86/mm/pat: Convert the PAT tree to a generic interval tree")
      
              dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
              found_type = match->type;
      
      -       node = rb_next(&match->rb);
      -       while (node) {
      -               match = rb_entry(node, struct memtype, rb);
      -
      -               if (match->start >= end) /* Checked all possible matches */
      -                       goto success;
      -
      -               if (is_node_overlap(match, start, end) &&
      -                   match->type != found_type) {
      +       match = memtype_interval_iter_next(match, start, end);
      +       while (match) {
      +               if (match->type != found_type)
                              goto failure;
      -               }
      
      -               node = rb_next(&match->rb);
      +               match = memtype_interval_iter_next(match, start, end);
              }
      
      Note how the '>= end' condition to end the interval check, got converted
      into:
      
      +       match = memtype_interval_iter_next(match, start, end);
      
      This is subtly off by one, because the interval trees interfaces require
      closed interval parameters:
      
        include/linux/interval_tree_generic.h
      
       /*                                                                            \
        * Iterate over intervals intersecting [start;last]                           \
        *                                                                            \
        * Note that a node's interval intersects [start;last] iff:                   \
        *   Cond1: ITSTART(node) <= last                                             \
        * and                                                                        \
        *   Cond2: start <= ITLAST(node)                                             \
        */                                                                           \
      
        ...
      
                      if (ITSTART(node) <= last) {            /* Cond1 */           \
                              if (start <= ITLAST(node))      /* Cond2 */           \
                                      return node;    /* node is leftmost match */  \
      
      [start;last] is a closed interval (note that '<= last' check) - while the
      PAT 'end' parameter is 1 byte beyond the end of the range, because
      ioremap() and the other mapping APIs usually use the [start,end)
      half-open interval, derived from 'size'.
      
      This is what ioremap() does for example:
      
              /*
               * Mappings have to be page-aligned
               */
              offset = phys_addr & ~PAGE_MASK;
              phys_addr &= PHYSICAL_PAGE_MASK;
              size = PAGE_ALIGN(last_addr+1) - phys_addr;
      
              retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
                                                      pcm, &new_pcm);
      
      phys_addr+size will be on a page boundary, after the last byte of the
      mapped interval.
      
      So the correct parameter to use in the interval tree searches is not
      'end' but 'end-1'.
      
      This could have relevance if conflicting PAT ranges are exactly adjacent,
      for example a future WC region is followed immediately by an already
      mapped UC- region - in this case memtype_check_conflict() would
      incorrectly deny the WC memtype region and downgrade the memtype to UC-.
      
      BTW., rather annoyingly this downgrading is done silently in
      memtype_check_insert():
      
      int memtype_check_insert(struct memtype *new,
                               enum page_cache_mode *ret_type)
      {
              int err = 0;
      
              err = memtype_check_conflict(new->start, new->end, new->type, ret_type);
              if (err)
                      return err;
      
              if (ret_type)
                      new->type = *ret_type;
      
              memtype_interval_insert(new, &memtype_rbroot);
              return 0;
      }
      
      So on such a conflict we'd just silently get UC- in *ret_type, and write
      it into the new region, never the wiser ...
      
      So assuming that the patch below fixes the primary bug the diagnostics
      side of ioremap() cache attribute downgrades would be another thing to
      fix.
      
      Anyway, I checked all the interval-tree iterations, and most of them are
      off by one - but I think the one related to memtype_check_conflict() is
      the one causing this particular performance regression.
      
      The only correct interval-tree searches were these two:
      
        arch/x86/mm/pat_interval.c:     match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX);
        arch/x86/mm/pat_interval.c:             match = memtype_interval_iter_next(match, 0, ULONG_MAX);
      
      The ULONG_MAX was hiding the off-by-one in plain sight. :-)
      
      Note that the bug was probably benign in the sense of implementing a too
      strict cache attribute conflict policy and downgrading cache attributes,
      so AFAICS the worst outcome of this bug would be a performance regression,
      not any instabilities.
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Reported-by: NKenneth R. Crudup <kenny@panix.com>
      Reported-by: NMariusz Ceier <mceier+kernel@gmail.com>
      Tested-by: NMariusz Ceier <mceier@gmail.com>
      Tested-by: NKenneth R. Crudup <kenny@panix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191201144947.GA4167@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      91298f1a
  11. 01 12月, 2019 1 次提交
  12. 29 11月, 2019 3 次提交
  13. 28 11月, 2019 1 次提交
    • S
      x86/fpu: Don't cache access to fpu_fpregs_owner_ctx · 59c4bd85
      Sebastian Andrzej Siewior 提交于
      The state/owner of the FPU is saved to fpu_fpregs_owner_ctx by pointing
      to the context that is currently loaded. It never changed during the
      lifetime of a task - it remained stable/constant.
      
      After deferred FPU registers loading until return to userland was
      implemented, the content of fpu_fpregs_owner_ctx may change during
      preemption and must not be cached.
      
      This went unnoticed for some time and was now noticed, in particular
      since gcc 9 is caching that load in copy_fpstate_to_sigframe() and
      reusing it in the retry loop:
      
        copy_fpstate_to_sigframe()
          load fpu_fpregs_owner_ctx and save on stack
          fpregs_lock()
          copy_fpregs_to_sigframe() /* failed */
          fpregs_unlock()
               *** PREEMPTION, another uses FPU, changes fpu_fpregs_owner_ctx ***
      
          fault_in_pages_writeable() /* succeed, retry */
      
          fpregs_lock()
      	__fpregs_load_activate()
      	  fpregs_state_valid() /* uses fpu_fpregs_owner_ctx from stack */
          copy_fpregs_to_sigframe() /* succeeds, random FPU content */
      
      This is a comparison of the assembly produced by gcc 9, without vs with this
      patch:
      
      | # arch/x86/kernel/fpu/signal.c:173:      if (!access_ok(buf, size))
      |        cmpq    %rdx, %rax      # tmp183, _4
      |        jb      .L190   #,
      |-# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |-#APP
      |-# 512 "arch/x86/include/asm/fpu/internal.h" 1
      |-       movq %gs:fpu_fpregs_owner_ctx,%rax      #, pfo_ret__
      |-# 0 "" 2
      |-#NO_APP
      |-       movq    %rax, -88(%rbp) # pfo_ret__, %sfp
      …
      |-# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |-       movq    -88(%rbp), %rcx # %sfp, pfo_ret__
      |-       cmpq    %rcx, -64(%rbp) # pfo_ret__, %sfp
      |+# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |+#APP
      |+# 512 "arch/x86/include/asm/fpu/internal.h" 1
      |+       movq %gs:fpu_fpregs_owner_ctx(%rip),%rax        # fpu_fpregs_owner_ctx, pfo_ret__
      |+# 0 "" 2
      |+# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |+#NO_APP
      |+       cmpq    %rax, -64(%rbp) # pfo_ret__, %sfp
      
      Use this_cpu_read() instead this_cpu_read_stable() to avoid caching of
      fpu_fpregs_owner_ctx during preemption points.
      
      The Fixes: tag points to the commit where deferred FPU loading was
      added. Since this commit, the compiler is no longer allowed to move the
      load of fpu_fpregs_owner_ctx somewhere else / outside of the locked
      section. A task preemption will change its value and stale content will
      be observed.
      
       [ bp: Massage. ]
      Debugged-by: NAustin Clements <austin@google.com>
      Debugged-by: NDavid Chase <drchase@golang.org>
      Debugged-by: NIan Lance Taylor <ian@airs.com>
      Fixes: 5f409e20 ("x86/fpu: Defer FPU state load until return to userspace")
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Cc: Aubrey Li <aubrey.li@intel.com>
      Cc: Austin Clements <austin@google.com>
      Cc: Barret Rhoden <brho@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Chase <drchase@golang.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: ian@airs.com
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Bleecher Snyder <josharian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191128085306.hxfa2o3knqtu4wfn@linutronix.de
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=205663
      59c4bd85
  14. 27 11月, 2019 10 次提交