1. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  2. 28 10月, 2019 3 次提交
  3. 08 10月, 2019 1 次提交
    • B
      x86/xen: Return from panic notifier · c6875f3a
      Boris Ostrovsky 提交于
      Currently execution of panic() continues until Xen's panic notifier
      (xen_panic_event()) is called at which point we make a hypercall that
      never returns.
      
      This means that any notifier that is supposed to be called later as
      well as significant part of panic() code (such as pstore writes from
      kmsg_dump()) is never executed.
      
      There is no reason for xen_panic_event() to be this last point in
      execution since panic()'s emergency_restart() will call into
      xen_emergency_restart() from where we can perform our hypercall.
      
      Nevertheless, we will provide xen_legacy_crash boot option that will
      preserve original behavior during crash. This option could be used,
      for example, if running kernel dumper (which happens after panic
      notifiers) is undesirable.
      Reported-by: NJames Dingwall <james@dingwall.me.uk>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      c6875f3a
  4. 25 9月, 2019 1 次提交
    • V
      mm, page_owner, debug_pagealloc: save and dump freeing stack trace · 8974558f
      Vlastimil Babka 提交于
      The debug_pagealloc functionality is useful to catch buggy page allocator
      users that cause e.g.  use after free or double free.  When page
      inconsistency is detected, debugging is often simpler by knowing the call
      stack of process that last allocated and freed the page.  When page_owner
      is also enabled, we record the allocation stack trace, but not freeing.
      
      This patch therefore adds recording of freeing process stack trace to page
      owner info, if both page_owner and debug_pagealloc are configured and
      enabled.  With only page_owner enabled, this info is not useful for the
      memory leak debugging use case.  dump_page() is adjusted to print the
      info.  An example result of calling __free_pages() twice may look like
      this (note the page last free stack trace):
      
      BUG: Bad page state in process bash  pfn:13d8f8
      page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1affff800000000()
      raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
      page dumped because: nonzero _refcount
      page_owner tracks the page as freed
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
       prep_new_page+0x143/0x150
       get_page_from_freelist+0x289/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       khugepaged+0x6e/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      page last free stack trace:
       free_pcp_prepare+0x134/0x1e0
       free_unref_page+0x18/0x90
       khugepaged+0x7b/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      Modules linked in:
      CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
      Call Trace:
       dump_stack+0x85/0xc0
       bad_page.cold+0xba/0xbf
       rmqueue_pcplist.isra.0+0x6c5/0x6d0
       rmqueue+0x2d/0x810
       get_page_from_freelist+0x191/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       __get_free_pages+0xd/0x30
       __pud_alloc+0x2c/0x110
       copy_page_range+0x4f9/0x630
       dup_mmap+0x362/0x480
       dup_mm+0x68/0x110
       copy_process+0x19e1/0x1b40
       _do_fork+0x73/0x310
       __x64_sys_clone+0x75/0x80
       do_syscall_64+0x6e/0x1e0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f10af854a10
      ...
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8974558f
  5. 14 9月, 2019 1 次提交
  6. 11 9月, 2019 1 次提交
  7. 05 9月, 2019 1 次提交
    • N
      powerpc/64s/radix: introduce options to disable use of the tlbie instruction · 2275d7b5
      Nicholas Piggin 提交于
      Introduce two options to control the use of the tlbie instruction. A
      boot time option which completely disables the kernel using the
      instruction, this is currently incompatible with HASH MMU, KVM, and
      coherent accelerators.
      
      And a debugfs option can be switched at runtime and avoids using tlbie
      for invalidating CPU TLBs for normal process and kernel address
      mappings. Coherent accelerators are still managed with tlbie, as will
      KVM partition scope translations.
      
      Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
      basic implementation which does not attempt to make any optimisation
      beyond the tlbie implementation.
      
      This is useful for performance testing among other things. For example
      in certain situations on large systems, using IPIs may be faster than
      tlbie as they can be directed rather than broadcast. Later we may also
      take advantage of the IPIs to do more interesting things such as trim
      the mm cpumask more aggressively.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
      2275d7b5
  8. 04 9月, 2019 1 次提交
  9. 03 9月, 2019 1 次提交
  10. 30 8月, 2019 1 次提交
  11. 28 8月, 2019 1 次提交
  12. 23 8月, 2019 1 次提交
  13. 22 8月, 2019 1 次提交
  14. 20 8月, 2019 2 次提交
    • M
      security: Add a static lockdown policy LSM · 000d388e
      Matthew Garrett 提交于
      While existing LSMs can be extended to handle lockdown policy,
      distributions generally want to be able to apply a straightforward
      static policy. This patch adds a simple LSM that can be configured to
      reject either integrity or all lockdown queries, and can be configured
      at runtime (through securityfs), boot time (via a kernel parameter) or
      build time (via a kconfig option). Based on initial code by David
      Howells.
      Signed-off-by: NMatthew Garrett <mjg59@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      000d388e
    • T
      x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h · c49a0a80
      Tom Lendacky 提交于
      There have been reports of RDRAND issues after resuming from suspend on
      some AMD family 15h and family 16h systems. This issue stems from a BIOS
      not performing the proper steps during resume to ensure RDRAND continues
      to function properly.
      
      RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
      reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
      support using CPUID, including the kernel, will believe that RDRAND is
      not supported.
      
      Update the CPU initialization to clear the RDRAND CPUID bit for any family
      15h and 16h processor that supports RDRAND. If it is known that the family
      15h or family 16h system does not have an RDRAND resume issue or that the
      system will not be placed in suspend, the "rdrand=force" kernel parameter
      can be used to stop the clearing of the RDRAND CPUID bit.
      
      Additionally, update the suspend and resume path to save and restore the
      MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
      place after resuming from suspend.
      
      Note, that clearing the RDRAND CPUID bit does not prevent a processor
      that normally supports the RDRAND instruction from executing it. So any
      code that determined the support based on family and model won't #UD.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
      Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "x86@kernel.org" <x86@kernel.org>
      Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com
      c49a0a80
  15. 17 8月, 2019 1 次提交
  16. 14 8月, 2019 1 次提交
  17. 09 8月, 2019 1 次提交
  18. 02 8月, 2019 1 次提交
  19. 01 8月, 2019 1 次提交
    • S
      of/platform: Add functional dependency link from DT bindings · 690ff788
      Saravana Kannan 提交于
      Add device-links after the devices are created (but before they are
      probed) by looking at common DT bindings like clocks and
      interconnects.
      
      Automatically adding device-links for functional dependencies at the
      framework level provides the following benefits:
      
      - Optimizes device probe order and avoids the useless work of
        attempting probes of devices that will not probe successfully
        (because their suppliers aren't present or haven't probed yet).
      
        For example, in a commonly available mobile SoC, registering just
        one consumer device's driver at an initcall level earlier than the
        supplier device's driver causes 11 failed probe attempts before the
        consumer device probes successfully. This was with a kernel with all
        the drivers statically compiled in. This problem gets a lot worse if
        all the drivers are loaded as modules without direct symbol
        dependencies.
      
      - Supplier devices like clock providers, interconnect providers, etc
        need to keep the resources they provide active and at a particular
        state(s) during boot up even if their current set of consumers don't
        request the resource to be active. This is because the rest of the
        consumers might not have probed yet and turning off the resource
        before all the consumers have probed could lead to a hang or
        undesired user experience.
      
        Some frameworks (Eg: regulator) handle this today by turning off
        "unused" resources at late_initcall_sync and hoping all the devices
        have probed by then. This is not a valid assumption for systems with
        loadable modules. Other frameworks (Eg: clock) just don't handle
        this due to the lack of a clear signal for when they can turn off
        resources. This leads to downstream hacks to handle cases like this
        that can easily be solved in the upstream kernel.
      
        By linking devices before they are probed, we give suppliers a clear
        count of the number of dependent consumers. Once all of the
        consumers are active, the suppliers can turn off the unused
        resources without making assumptions about the number of consumers.
      
      By default we just add device-links to track "driver presence" (probe
      succeeded) of the supplier device. If any other functionality provided
      by device-links are needed, it is left to the consumer/supplier
      devices to change the link when they probe.
      
      kbuild test robot reported clang error about missing const
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NSaravana Kannan <saravanak@google.com>
      Link: https://lore.kernel.org/r/20190731221721.187713-4-saravanak@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      690ff788
  20. 24 7月, 2019 1 次提交
  21. 17 7月, 2019 4 次提交
  22. 15 7月, 2019 11 次提交
  23. 13 7月, 2019 2 次提交
    • A
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko 提交于
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • V
      mm, debug_pagealloc: use a page type instead of page_ext flag · 3972f6bb
      Vlastimil Babka 提交于
      When debug_pagealloc is enabled, we currently allocate the page_ext
      array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag.  Now that
      we have the page_type field in struct page, we can use that instead, as
      guard pages are neither PageSlab nor mapped to userspace.  This reduces
      memory overhead when debug_pagealloc is enabled and there are no other
      features requiring the page_ext array.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3972f6bb