1. 13 5月, 2017 1 次提交
  2. 11 5月, 2017 1 次提交
  3. 09 5月, 2017 4 次提交
  4. 08 5月, 2017 2 次提交
    • X
      x86/kexec/64: Use gbpages for identity mappings if available · 8638100c
      Xunlei Pang 提交于
      Kexec sets up all identity mappings before booting into the new
      kernel, and this will cause extra memory consumption for paging
      structures which is quite considerable on modern machines with
      huge memory sizes.
      
      E.g. on a 32TB machine that is kdumping, it could waste around
      128MB (around 4MB/TB) from the reserved memory after kexec sets
      all the identity mappings using the current 2MB page.
      
      Add to that the memory needed for the loaded kdump kernel, initramfs,
      etc., and it causes a kexec syscall -NOMEM failure.
      
      As a result, we had to enlarge reserved memory via "crashkernel=X"
      to work around this problem.
      
      This causes some trouble for distributions that use policies
      to evaluate the proper "crashkernel=X" value for users.
      
      So enable gbpages for kexec mappings.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: akpm@linux-foundation.org
      Cc: kexec@lists.infradead.org
      Link: http://lkml.kernel.org/r/1493862171-8799-2-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8638100c
    • X
      x86/mm: Add support for gbpages to kernel_ident_mapping_init() · 66aad4fd
      Xunlei Pang 提交于
      Kernel identity mappings on x86-64 kernels are created in two
      ways: by the early x86 boot code, or by kernel_ident_mapping_init().
      
      Native kernels (which is the dominant usecase) use the former,
      but the kexec and the hibernation code uses kernel_ident_mapping_init().
      
      There's a subtle difference between these two ways of how identity
      mappings are created, the current kernel_ident_mapping_init() code
      creates identity mappings always using 2MB page(PMD level) - while
      the native kernel boot path also utilizes gbpages where available.
      
      This difference is suboptimal both for performance and for memory
      usage: kernel_ident_mapping_init() needs to allocate pages for the
      page tables when creating the new identity mappings.
      
      This patch adds 1GB page(PUD level) support to kernel_ident_mapping_init()
      to address these concerns.
      
      The primary advantage would be better TLB coverage/performance,
      because we'd utilize 1GB TLBs instead of 2MB ones.
      
      It is also useful for machines with large number of memory to
      save paging structure allocations(around 4MB/TB using 2MB page)
      when setting identity mappings for all the memory, after using
      1GB page it will consume only 8KB/TB.
      
      ( Note that this change alone does not activate gbpages in kexec,
        we are doing that in a separate patch. )
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: akpm@linux-foundation.org
      Cc: kexec@lists.infradead.org
      Link: http://lkml.kernel.org/r/1493862171-8799-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      66aad4fd
  5. 02 5月, 2017 4 次提交
  6. 27 4月, 2017 1 次提交
    • S
      x86, iommu/vt-d: Add an option to disable Intel IOMMU force on · bfd20f1c
      Shaohua Li 提交于
      IOMMU harms performance signficantly when we run very fast networking
      workloads. It's 40GB networking doing XDP test. Software overhead is
      almost unaware, but it's the IOTLB miss (based on our analysis) which
      kills the performance. We observed the same performance issue even with
      software passthrough (identity mapping), only the hardware passthrough
      survives. The pps with iommu (with software passthrough) is only about
      ~30% of that without it. This is a limitation in hardware based on our
      observation, so we'd like to disable the IOMMU force on, but we do want
      to use TBOOT and we can sacrifice the DMA security bought by IOMMU. I
      must admit I know nothing about TBOOT, but TBOOT guys (cc-ed) think not
      eabling IOMMU is totally ok.
      
      So introduce a new boot option to disable the force on. It's kind of
      silly we need to run into intel_iommu_init even without force on, but we
      need to disable TBOOT PMR registers. For system without the boot option,
      nothing is changed.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      bfd20f1c
  7. 26 4月, 2017 3 次提交
    • A
      x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() · 9ccee237
      Andy Lutomirski 提交于
      mark_screen_rdonly() is the last remaining caller of flush_tlb().
      flush_tlb_mm_range() is potentially faster and isn't obsolete.
      
      Compile-tested only because I don't know whether software that uses
      this mechanism even exists.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/791a644076fc3577ba7f7b7cafd643cc089baa7d.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9ccee237
    • J
      x86/unwind: Dump all stacks in unwind_dump() · 262fa734
      Josh Poimboeuf 提交于
      Currently unwind_dump() dumps only the most recently accessed stack.
      But it has a few issues.
      
      In some cases, 'first_sp' can get out of sync with 'stack_info', causing
      unwind_dump() to start from the wrong address, flood the printk buffer,
      and eventually read a bad address.
      
      In other cases, dumping only the most recently accessed stack doesn't
      give enough data to diagnose the error.
      
      Fix both issues by dumping *all* stacks involved in the trace, not just
      the last one.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 8b5e99f0 ("x86/unwind: Dump stack data on warnings")
      Link: http://lkml.kernel.org/r/016d6a9810d7d1bfc87ef8c0e6ee041c6744c909.1493171120.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      262fa734
    • J
      x86/unwind: Silence more entry-code related warnings · b0d50c7b
      Josh Poimboeuf 提交于
      Borislav Petkov reported the following unwinder warning:
      
        WARNING: kernel stack regs at ffffc9000024fea8 in udevadm:92 has bad 'bp' value 00007fffc4614d30
        unwind stack type:0 next_sp:          (null) mask:0x6 graph_idx:0
        ffffc9000024fea8: 000055a6100e9b38 (0x55a6100e9b38)
        ffffc9000024feb0: 000055a6100e9b35 (0x55a6100e9b35)
        ffffc9000024feb8: 000055a6100e9f68 (0x55a6100e9f68)
        ffffc9000024fec0: 000055a6100e9f50 (0x55a6100e9f50)
        ffffc9000024fec8: 00007fffc4614d30 (0x7fffc4614d30)
        ffffc9000024fed0: 000055a6100eaf50 (0x55a6100eaf50)
        ffffc9000024fed8: 0000000000000000 ...
        ffffc9000024fee0: 0000000000000100 (0x100)
        ffffc9000024fee8: ffff8801187df488 (0xffff8801187df488)
        ffffc9000024fef0: 00007ffffffff000 (0x7ffffffff000)
        ffffc9000024fef8: 0000000000000000 ...
        ffffc9000024ff10: ffffc9000024fe98 (0xffffc9000024fe98)
        ffffc9000024ff18: 00007fffc4614d00 (0x7fffc4614d00)
        ffffc9000024ff20: ffffffffffffff10 (0xffffffffffffff10)
        ffffc9000024ff28: ffffffff811c6c1f (SyS_newlstat+0xf/0x10)
        ffffc9000024ff30: 0000000000000010 (0x10)
        ffffc9000024ff38: 0000000000000296 (0x296)
        ffffc9000024ff40: ffffc9000024ff50 (0xffffc9000024ff50)
        ffffc9000024ff48: 0000000000000018 (0x18)
        ffffc9000024ff50: ffffffff816b2e6a (entry_SYSCALL_64_fastpath+0x18/0xa8)
        ...
      
      It unwinded from an interrupt which came in right after entry code
      called into a C syscall handler, before it had a chance to set up the
      frame pointer, so regs->bp still had its user space value.
      
      Add a check to silence warnings in such a case, where an interrupt
      has occurred and regs->sp is almost at the end of the stack.
      Reported-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: c32c47c6 ("x86/unwind: Warn on bad frame pointer")
      Link: http://lkml.kernel.org/r/c695f0d0d4c2cfe6542b90e2d0520e11eb901eb5.1493171120.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0d50c7b
  8. 21 4月, 2017 1 次提交
    • S
      x86/ftrace: Fix ebp in ftrace_regs_caller that screws up unwinder · dc912c30
      Steven Rostedt (VMware) 提交于
      Fengguang Wu's zero day bot triggered a stack unwinder dump. This can
      be easily triggered when CONFIG_FRAME_POINTERS is enabled and -mfentry
      is in use on x86_32.
      
       ># cd /sys/kernel/debug/tracing
       ># echo 'p:schedule schedule' > kprobe_events
       ># echo stacktrace > events/kprobes/schedule/trigger
      
      This is because the code that implemented fentry in the ftrace_regs_caller
      tried to use the least amount of #ifdefs, and modified ebp when
      CC_USE_FENTRY was defined to point to the parent ip as it does when
      CC_USE_FENTRY is not defined. But when CONFIG_FRAME_POINTERS is set, it
      corrupts the ebp register for this frame while doing the tracing.
      
      NOTE, it does not corrupt ebp in any other way. It is just a bad frame
      pointer when calling into the tracing infrastructure. The original ebp is
      restored before returning from the fentry call. But if a stack trace is
      performed inside the tracing, the unwinder will notice the bad ebp.
      
      Instead of toying with ebp with CC_USING_FENTRY, just slap the parent ip
      into the second parameter (%edx), and have an #else that does it the
      original way.
      
      The unwinder will unfortunately miss the function being traced, as the
      stack frame is not set up yet for it, as it is for x86_64. But fixing that
      is a bit more complex and did not work before anyway.
      
      This has been tested with and without FRAME_POINTERS being set while using
      -mfentry, as well as using an older compiler that uses mcount.
      Analyzed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Fixes: 644e0e8d ("x86/ftrace: Add -mfentry support to x86_32 with DYNAMIC_FTRACE set")
      Reported-by: Nkernel test robot <fengguang.wu@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Link: https://lists.01.org/pipermail/lkp/2017-April/006165.html
      Link: http://lkml.kernel.org/r/20170420172236.7af7f6e5@gandalf.local.homeSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      dc912c30
  9. 20 4月, 2017 6 次提交
  10. 19 4月, 2017 6 次提交
  11. 18 4月, 2017 2 次提交
  12. 15 4月, 2017 2 次提交
    • D
      x86/irq: Remove a redundant #ifdef directive · 0ccecd95
      Dou Liyang 提交于
      The call to irq_ctx_init() is wrapped in #ifdef CONFIG_X86_32.
      
      The declaration of irq_ctx_init in irq.h provides already a stub inline for
      the X86_32=n case.
      
      Remove the redundant #ifdef in the code.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Link: http://lkml.kernel.org/r/1491811500-30307-1-git-send-email-douly.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0ccecd95
    • N
      x86/apic/timer: Set ->min_delta_ticks and ->max_delta_ticks · 747d04b3
      Nicolai Stange 提交于
      In preparation for making the clockevents core NTP correction aware,
      all clockevent device drivers must set ->min_delta_ticks and
      ->max_delta_ticks rather than ->min_delta_ns and ->max_delta_ns: a
      clockevent device's rate is going to change dynamically and thus, the
      ratio of ns to ticks ceases to stay invariant.
      
      Make the x86 arch's apic clockevent driver initialize these fields
      properly.
      
      This patch alone doesn't introduce any change in functionality as the
      clockevents core still looks exclusively at the (untouched) ->min_delta_ns
      and ->max_delta_ns. As soon as this has changed, a followup patch will
      purge the initialization of ->min_delta_ns and ->max_delta_ns from this
      driver.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      CC: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      747d04b3
  13. 14 4月, 2017 7 次提交