1. 16 9月, 2018 1 次提交
    • B
      x86/mm: Add .bss..decrypted section to hold shared variables · b3f0907c
      Brijesh Singh 提交于
      kvmclock defines few static variables which are shared with the
      hypervisor during the kvmclock initialization.
      
      When SEV is active, memory is encrypted with a guest-specific key, and
      if the guest OS wants to share the memory region with the hypervisor
      then it must clear the C-bit before sharing it.
      
      Currently, we use kernel_physical_mapping_init() to split large pages
      before clearing the C-bit on shared pages. But it fails when called from
      the kvmclock initialization (mainly because the memblock allocator is
      not ready that early during boot).
      
      Add a __bss_decrypted section attribute which can be used when defining
      such shared variable. The so-defined variables will be placed in the
      .bss..decrypted section. This section will be mapped with C=0 early
      during boot.
      
      The .bss..decrypted section has a big chunk of memory that may be unused
      when memory encryption is not active, free it when memory encryption is
      not active.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Radim Krčmář<rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
      b3f0907c
  2. 20 7月, 2018 1 次提交
    • J
      x86/mm/pti: Make pti_clone_kernel_text() compile on 32 bit · 39d668e0
      Joerg Roedel 提交于
      The pti_clone_kernel_text() function references __end_rodata_hpage_align,
      which is only present on x86-64.  This makes sense as the end of the rodata
      section is not huge-page aligned on 32 bit.
      
      Nevertheless a symbol is required for the function that points at the right
      address for both 32 and 64 bit. Introduce __end_rodata_aligned for that
      purpose and use it in pti_clone_kernel_text().
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NPavel Machek <pavel@ucw.cz>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Waiman Long <llong@redhat.com>
      Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
      Cc: joro@8bytes.org
      Link: https://lkml.kernel.org/r/1531906876-13451-28-git-send-email-joro@8bytes.org
      39d668e0
  3. 13 5月, 2018 1 次提交
  4. 09 3月, 2018 1 次提交
    • F
      x86/kprobes: Fix kernel crash when probing .entry_trampoline code · c07a8f8b
      Francis Deslauriers 提交于
      Disable the kprobe probing of the entry trampoline:
      
      .entry_trampoline is a code area that is used to ensure page table
      isolation between userspace and kernelspace.
      
      At the beginning of the execution of the trampoline, we load the
      kernel's CR3 register. This has the effect of enabling the translation
      of the kernel virtual addresses to physical addresses. Before this
      happens most kernel addresses can not be translated because the running
      process' CR3 is still used.
      
      If a kprobe is placed on the trampoline code before that change of the
      CR3 register happens the kernel crashes because int3 handling pages are
      not accessible.
      
      To fix this, add the .entry_trampoline section to the kprobe blacklist
      to prohibit the probing of code before all the kernel pages are
      accessible.
      Signed-off-by: NFrancis Deslauriers <francis.deslauriers@efficios.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: mathieu.desnoyers@efficios.com
      Cc: mhiramat@kernel.org
      Link: http://lkml.kernel.org/r/1520565492-4637-2-git-send-email-francis.deslauriers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c07a8f8b
  5. 13 2月, 2018 1 次提交
  6. 19 1月, 2018 1 次提交
  7. 24 12月, 2017 1 次提交
    • T
      x86/entry: Align entry text section to PMD boundary · 2f7412ba
      Thomas Gleixner 提交于
      The (irq)entry text must be visible in the user space page tables. To allow
      simple PMD based sharing, make the entry text PMD aligned.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2f7412ba
  8. 17 12月, 2017 1 次提交
    • A
      x86/entry/64: Create a per-CPU SYSCALL entry trampoline · 3386bc8a
      Andy Lutomirski 提交于
      Handling SYSCALL is tricky: the SYSCALL handler is entered with every
      single register (except FLAGS), including RSP, live.  It somehow needs
      to set RSP to point to a valid stack, which means it needs to save the
      user RSP somewhere and find its own stack pointer.  The canonical way
      to do this is with SWAPGS, which lets us access percpu data using the
      %gs prefix.
      
      With PAGE_TABLE_ISOLATION-like pagetable switching, this is
      problematic.  Without a scratch register, switching CR3 is impossible, so
      %gs-based percpu memory would need to be mapped in the user pagetables.
      Doing that without information leaks is difficult or impossible.
      
      Instead, use a different sneaky trick.  Map a copy of the first part
      of the SYSCALL asm at a different address for each CPU.  Now RIP
      varies depending on the CPU, so we can use RIP-relative memory access
      to access percpu memory.  By putting the relevant information (one
      scratch slot and the stack address) at a constant offset relative to
      RIP, we can make SYSCALL work without relying on %gs.
      
      A nice thing about this approach is that we can easily switch it on
      and off if we want pagetable switching to be configurable.
      
      The compat variant of SYSCALL doesn't have this problem in the first
      place -- there are plenty of scratch registers, since we don't care
      about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
      at all.
      
      This patch actually seems to be a small speedup.  With this patch,
      SYSCALL touches an extra cache line and an extra virtual page, but
      the pipeline no longer stalls waiting for SWAPGS.  It seems that, at
      least in a tight loop, the latter outweights the former.
      
      Thanks to David Laight for an optimization tip.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bpetkov@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Link: https://lkml.kernel.org/r/20171204150606.403607157@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3386bc8a
  9. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  10. 26 7月, 2017 1 次提交
    • J
      x86/unwind: Add the ORC unwinder · ee9f8fce
      Josh Poimboeuf 提交于
      Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
      It plugs into the existing x86 unwinder framework.
      
      It relies on objtool to generate the needed .orc_unwind and
      .orc_unwind_ip sections.
      
      For more details on why ORC is used instead of DWARF, see
      Documentation/x86/orc-unwinder.txt - but the short version is
      that it's a simplified, fundamentally more robust debugninfo
      data structure, which also allows up to two orders of magnitude
      faster lookups than the DWARF unwinder - which matters to
      profiling workloads like perf.
      
      Thanks to Andy Lutomirski for the performance improvement ideas:
      splitting the ORC unwind table into two parallel arrays and creating a
      fast lookup table to search a subset of the unwind table.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
      [ Extended the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ee9f8fce
  11. 03 4月, 2017 1 次提交
    • P
      debug: Fix __bug_table[] in arch linker scripts · b5effd38
      Peter Zijlstra 提交于
      The kbuild test robot reported this build failure on a number
      of architectures:
      
       >         make.cross ARCH=arm
       >    lib/lib.a(bug.o): In function `find_bug':
       > >> lib/bug.c:135: undefined reference to `__start___bug_table'
       > >> lib/bug.c:135: undefined reference to `__stop___bug_table'
      
      Caused by:
      
        19d43626 ("debug: Add _ONCE() logic to report_bug()")
      
      Which moved the BUG_TABLE from RO_DATA_SECTION() to RW_DATA_SECTION(),
      but a number of architectures don't use RW_DATA_SECTION(), so they
      ended up with no __bug_table[] ...
      
      Ideally all those would use RW_DATA_SECTION() in their linker scripts,
      but that's for another day.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: kbuild-all@01.org
      Cc: tipbuild@zytor.com
      Link: http://lkml.kernel.org/r/20170330154927.o6qmgfp4bdhrajbm@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5effd38
  12. 02 3月, 2017 1 次提交
  13. 24 2月, 2017 1 次提交
    • J
      objtool: Improve detection of BUG() and other dead ends · d1091c7f
      Josh Poimboeuf 提交于
      The BUG() macro's use of __builtin_unreachable() via the unreachable()
      macro tells gcc that the instruction is a dead end, and that it's safe
      to assume the current code path will not execute past the previous
      instruction.
      
      On x86, the BUG() macro is implemented with the 'ud2' instruction.  When
      objtool's branch analysis sees that instruction, it knows the current
      code path has come to a dead end.
      
      Peter Zijlstra has been working on a patch to change the WARN macros to
      use 'ud2'.  That patch will break objtool's assumption that 'ud2' is
      always a dead end.
      
      Generally it's best for objtool to avoid making those kinds of
      assumptions anyway.  The more ignorant it is of kernel code internals,
      the better.
      
      So create a more generic way for objtool to detect dead ends by adding
      an annotation to the unreachable() macro.  The annotation stores a
      pointer to the end of the unreachable code path in an '__unreachable'
      section.  Objtool can read that section to find the dead ends.
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/41a6d33971462ebd944a1c60ad4bf5be86c17b77.1487712920.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d1091c7f
  14. 20 10月, 2016 1 次提交
    • J
      x86/boot: Move the _stext marker to before the boot code · e728f61c
      Josh Poimboeuf 提交于
      When core_kernel_text() is used to determine whether an address on a
      task's stack trace is a kernel text address, it incorrectly returns
      false for early text addresses for the head code between the _text and
      _stext markers.  Among other things, this can cause the unwinder to
      behave incorrectly when unwinding to x86 head code.
      
      Head code is text code too, so mark it as such.  This seems to match the
      intent of other users of the _stext symbol, and it also seems consistent
      with what other architectures are already doing.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/789cf978866420e72fa89df44aa2849426ac378d.1474480779.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e728f61c
  15. 08 10月, 2016 1 次提交
  16. 29 4月, 2016 1 次提交
    • Y
      x86/boot: Move compressed kernel to the end of the decompression buffer · 974f221c
      Yinghai Lu 提交于
      This change makes later calculations about where the kernel is located
      easier to reason about. To better understand this change, we must first
      clarify what 'VO' and 'ZO' are. These values were introduced in commits
      by hpa:
      
        77d1a499 ("x86, boot: make symbols from the main vmlinux available")
        37ba7ab5 ("x86, boot: make kernel_alignment adjustable; new bzImage fields")
      
      Specifically:
      
      All names prefixed with 'VO_':
      
       - relate to the uncompressed kernel image
      
       - the size of the VO image is: VO__end-VO__text ("VO_INIT_SIZE" define)
      
      All names prefixed with 'ZO_':
      
       - relate to the bootable compressed kernel image (boot/compressed/vmlinux),
         which is composed of the following memory areas:
           - head text
           - compressed kernel (VO image and relocs table)
           - decompressor code
      
       - the size of the ZO image is: ZO__end - ZO_startup_32 ("ZO_INIT_SIZE" define, though see below)
      
      The 'INIT_SIZE' value is used to find the larger of the two image sizes:
      
       #define ZO_INIT_SIZE    (ZO__end - ZO_startup_32 + ZO_z_extract_offset)
       #define VO_INIT_SIZE    (VO__end - VO__text)
      
       #if ZO_INIT_SIZE > VO_INIT_SIZE
       # define INIT_SIZE ZO_INIT_SIZE
       #else
       # define INIT_SIZE VO_INIT_SIZE
       #endif
      
      The current code uses extract_offset to decide where to position the
      copied ZO (i.e. ZO starts at extract_offset). (This is why ZO_INIT_SIZE
      currently includes the extract_offset.)
      
      Why does z_extract_offset exist? It's needed because we are trying to minimize
      the amount of RAM used for the whole act of creating an uncompressed, executable,
      properly relocation-linked kernel image in system memory. We do this so that
      kernels can be booted on even very small systems.
      
      To achieve the goal of minimal memory consumption we have implemented an in-place
      decompression strategy: instead of cleanly separating the VO and ZO images and
      also allocating some memory for the decompression code's runtime needs, we instead
      create this elaborate layout of memory buffers where the output (decompressed)
      stream, as it progresses, overlaps with and destroys the input (compressed)
      stream. This can only be done safely if the ZO image is placed to the end of the
      VO range, plus a certain amount of safety distance to make sure that when the last
      bytes of the VO range are decompressed, the compressed stream pointer is safely
      beyond the end of the VO range.
      
      z_extract_offset is calculated in arch/x86/boot/compressed/mkpiggy.c during
      the build process, at a point when we know the exact compressed and
      uncompressed size of the kernel images and can calculate this safe minimum
      offset value. (Note that the mkpiggy.c calculation is not perfect, because
      we don't know the decompressor used at that stage, so the z_extract_offset
      calculation is necessarily imprecise and is mostly based on gzip internals -
      we'll improve that in the next patch.)
      
      When INIT_SIZE is bigger than VO_INIT_SIZE (uncommon but possible),
      the copied ZO occupies the memory from extract_offset to the end of
      decompression buffer. It overlaps with the soon-to-be-uncompressed kernel
      like this:
      
                                  |-----compressed kernel image------|
                                  V                                  V
      0                       extract_offset                      +INIT_SIZE
      |-----------|---------------|-------------------------|--------|
                  |               |                         |        |
                VO__text      startup_32 of ZO          VO__end    ZO__end
                  ^                                         ^
                  |-------uncompressed kernel image---------|
      
      When INIT_SIZE is equal to VO_INIT_SIZE (likely) there's still space
      left from end of ZO to the end of decompressing buffer, like below.
      
                                  |-compressed kernel image-|
                                  V                         V
      0                       extract_offset                      +INIT_SIZE
      |-----------|---------------|-------------------------|--------|
                  |               |                         |        |
                VO__text      startup_32 of ZO          ZO__end    VO__end
                  ^                                                  ^
                  |------------uncompressed kernel image-------------|
      
      To simplify calculations and avoid special cases, it is cleaner to
      always place the compressed kernel image in memory so that ZO__end
      is at the end of the decompression buffer, instead of placing t at
      the start of extract_offset as is currently done.
      
      This patch adds BP_init_size (which is the INIT_SIZE as passed in from
      the boot_params) into asm-offsets.c to make it visible to the assembly
      code.
      
      Then when moving the ZO, it calculates the starting position of
      the copied ZO (via BP_init_size and the ZO run size) so that the VO__end
      will be at the end of the decompression buffer. To make the position
      calculation safe, the end of ZO is page aligned (and a comment is added
      to the existing VO alignment for good measure).
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      [ Rewrote changelog and comments. ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: lasse.collin@tukaani.org
      Link: http://lkml.kernel.org/r/1461888548-32439-3-git-send-email-keescook@chromium.org
      [ Rewrote the changelog some more. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      974f221c
  17. 26 3月, 2016 1 次提交
  18. 21 3月, 2016 1 次提交
    • A
      x86/kallsyms: fix GOLD link failure with new relative kallsyms table format · 142b9e6c
      Ard Biesheuvel 提交于
      Commit 2213e9a6 ("kallsyms: add support for relative offsets in
      kallsyms address table") changed the default kallsyms symbol table
      format to use relative references rather than absolute addresses.
      
      This reduces the size of the kallsyms symbol table by 50% on 64-bit
      architectures, and further reduces the size of the relocation tables
      used by relocatable kernels.  Since the memory footprint of the static
      kernel image is always much smaller than 4 GB, these relative references
      are assumed to be representable in 32 bits, even when the native word
      size is 64 bits.
      
      On 64-bit architectures, this obviously only works if the distance
      between each relative reference and the chosen anchor point is
      representable in 32 bits, and so the table generation code in
      scripts/kallsyms.c scans the table for the lowest value that is covered
      by the kernel text, and selects it as the anchor point.
      
      However, when using the GOLD linker rather than the default BFD linker
      to build the x86_64 kernel, the symbol phys_offset_64, which is the
      result of arithmetic defined in the linker script, is emitted as a 'T'
      rather than an 'A' type symbol, resulting in scripts/kallsyms.c to
      mistake it for a suitable anchor point, even though it is far away from
      the actual kernel image in the virtual address space.  This results in
      out-of-range warnings from scripts/kallsyms.c and a broken build.
      
      So let's align with the BFD linker, and emit the phys_offset_[32|64]
      symbols as absolute symbols explicitly.  Note that the out of range
      issue does not exist on 32-bit x86, but this patch changes both symbols
      for symmetry.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      142b9e6c
  19. 29 2月, 2016 1 次提交
  20. 22 2月, 2016 1 次提交
  21. 30 1月, 2016 1 次提交
  22. 29 11月, 2015 1 次提交
  23. 11 9月, 2015 1 次提交
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
  24. 05 11月, 2014 1 次提交
    • J
      x86-64: Use RIP-relative addressing for most per-CPU accesses · 97b67ae5
      Jan Beulich 提交于
      Observing that per-CPU data (in the SMP case) is reachable by
      exploiting 64-bit address wraparound (building on the default kernel
      load address being at 16Mb), the one byte shorter RIP-relative
      addressing form can be used for most per-CPU accesses. The one
      exception are the "stable" reads, where the use of the "P" operand
      modifier prevents the compiler from using RIP-relative addressing, but
      is unavoidable due to the use of the "p" constraint (side note: with
      gcc 4.9.x the intended effect of this isn't being achieved anymore,
      see gcc bug 63637).
      
      With the dependency on the minimum kernel load address, arbitrarily
      low values for CONFIG_PHYSICAL_START are now no longer possible. A
      link time assertion is being added, directing to the need to increase
      that value when it triggers.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Link: http://lkml.kernel.org/r/5458A1780200007800044A9D@mail.emea.novell.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      97b67ae5
  25. 19 3月, 2014 2 次提交
  26. 18 10月, 2013 1 次提交
  27. 11 3月, 2013 1 次提交
  28. 09 5月, 2012 1 次提交
  29. 11 8月, 2011 1 次提交
  30. 05 8月, 2011 2 次提交
  31. 15 7月, 2011 1 次提交
  32. 06 6月, 2011 3 次提交
  33. 24 5月, 2011 1 次提交
    • A
      x86-64: Clean up vdso/kernel shared variables · 8c49d9a7
      Andy Lutomirski 提交于
      Variables that are shared between the vdso and the kernel are
      currently a bit of a mess.  They are each defined with their own
      magic, they are accessed differently in the kernel, the vsyscall page,
      and the vdso, and one of them (vsyscall_clock) doesn't even really
      exist.
      
      This changes them all to use a common mechanism.  All of them are
      delcared in vvar.h with a fixed address (validated by the linker
      script).  In the kernel (as before), they look like ordinary
      read-write variables.  In the vsyscall page and the vdso, they are
      accessed through a new macro VVAR, which gives read-only access.
      
      The vdso is now loaded verbatim into memory without any fixups.  As a
      side bonus, access from the vdso is faster because a level of
      indirection is removed.
      
      While we're at it, pack jiffies and vgetcpu_mode into the same
      cacheline.
      Signed-off-by: NAndy Lutomirski <luto@mit.edu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Borislav Petkov <bp@amd64.org>
      Link: http://lkml.kernel.org/r/%3C7357882fbb51fa30491636a7b6528747301b7ee9.1306156808.git.luto%40mit.edu%3ESigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8c49d9a7
  34. 22 5月, 2011 1 次提交
  35. 25 3月, 2011 1 次提交
    • T
      percpu: Always align percpu output section to PAGE_SIZE · 0415b00d
      Tejun Heo 提交于
      Percpu allocator honors alignment request upto PAGE_SIZE and both the
      percpu addresses in the percpu address space and the translated kernel
      addresses should be aligned accordingly.  The calculation of the
      former depends on the alignment of percpu output section in the kernel
      image.
      
      The linker script macros PERCPU_VADDR() and PERCPU() are used to
      define this output section and the latter takes @align parameter.
      Several architectures are using @align smaller than PAGE_SIZE breaking
      percpu memory alignment.
      
      This patch removes @align parameter from PERCPU(), renames it to
      PERCPU_SECTION() and makes it always align to PAGE_SIZE.  While at it,
      add PCPU_SETUP_BUG_ON() checks such that alignment problems are
      reliably detected and remove percpu alignment comment recently added
      in workqueue.c as the condition would trigger BUG way before reaching
      there.
      
      For um, this patch raises the alignment of percpu area.  As the area
      is in .init, there shouldn't be any noticeable difference.
      
      This problem was discovered by David Howells while debugging boot
      failure on mn10300.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Cc: uclinux-dist-devel@blackfin.uclinux.org
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: user-mode-linux-devel@lists.sourceforge.net
      0415b00d
  36. 09 3月, 2011 1 次提交
    • J
      x86: Separate out entry text section · ea714547
      Jiri Olsa 提交于
      Put x86 entry code into a separate link section: .entry.text.
      
      Separating the entry text section seems to have performance
      benefits - caused by more efficient instruction cache usage.
      
      Running hackbench with perf stat --repeat showed that the change
      compresses the icache footprint. The icache load miss rate went
      down by about 15%:
      
       before patch:
               19417627  L1-icache-load-misses      ( +-   0.147% )
      
       after patch:
               16490788  L1-icache-load-misses      ( +-   0.180% )
      
      The motivation of the patch was to fix a particular kprobes
      bug that relates to the entry text section, the performance
      advantage was discovered accidentally.
      
      Whole perf output follows:
      
       - results for current tip tree:
      
        Performance counter stats for './hackbench/hackbench 10' (500 runs):
      
               19417627  L1-icache-load-misses      ( +-   0.147% )
             2676914223  instructions             #      0.497 IPC     ( +- 0.079% )
             5389516026  cycles                     ( +-   0.144% )
      
            0.206267711  seconds time elapsed   ( +-   0.138% )
      
       - results for current tip tree with the patch applied:
      
        Performance counter stats for './hackbench/hackbench 10' (500 runs):
      
               16490788  L1-icache-load-misses      ( +-   0.180% )
             2717734941  instructions             #      0.502 IPC     ( +- 0.079% )
             5414756975  cycles                     ( +-   0.148% )
      
            0.206747566  seconds time elapsed   ( +-   0.137% )
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: masami.hiramatsu.pt@hitachi.com
      Cc: ananth@in.ibm.com
      Cc: davem@davemloft.net
      Cc: 2nddept-manager@sdl.hitachi.co.jp
      LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ea714547