1. 18 7月, 2015 1 次提交
    • D
      x86/fpu, sched: Dynamically allocate 'struct fpu' · 0c8c0f03
      Dave Hansen 提交于
      The FPU rewrite removed the dynamic allocations of 'struct fpu'.
      But, this potentially wastes massive amounts of memory (2k per
      task on systems that do not have AVX-512 for instance).
      
      Instead of having a separate slab, this patch just appends the
      space that we need to the 'task_struct' which we dynamically
      allocate already.  This saves from doing an extra slab
      allocation at fork().
      
      The only real downside here is that we have to stick everything
      and the end of the task_struct.  But, I think the
      BUILD_BUG_ON()s I stuck in there should keep that from being too
      fragile.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0c8c0f03
  2. 17 7月, 2015 2 次提交
  3. 07 7月, 2015 3 次提交
    • T
      x86/irq: Retrieve irq data after locking irq_desc · 09cf92b7
      Thomas Gleixner 提交于
      irq_data is protected by irq_desc->lock, so retrieving the irq chip
      from irq_data outside the lock is racy vs. an concurrent update. Move
      it into the lock held region.
      
      While at it add a comment why the vector walk does not require
      vector_lock.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: xiao jin <jin.xiao@intel.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Link: http://lkml.kernel.org/r/20150705171102.331320612@linutronix.de
      09cf92b7
    • T
      x86/irq: Use proper locking in check_irq_vectors_for_cpu_disable() · cbb24dc7
      Thomas Gleixner 提交于
      It's unsafe to examine fields in the irq descriptor w/o holding the
      descriptor lock. Add proper locking.
      
      While at it add a comment why the vector check can run lock less
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: xiao jin <jin.xiao@intel.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Link: http://lkml.kernel.org/r/20150705171102.236544164@linutronix.de
      cbb24dc7
    • T
      x86/irq: Plug irq vector hotplug race · 5a3f75e3
      Thomas Gleixner 提交于
      Jin debugged a nasty cpu hotplug race which results in leaking a irq
      vector on the newly hotplugged cpu.
      
      cpu N				cpu M
      native_cpu_up                   device_shutdown
        do_boot_cpu			  free_msi_irqs
        start_secondary                   arch_teardown_msi_irqs
          smp_callin                        default_teardown_msi_irqs
             setup_vector_irq                  arch_teardown_msi_irq
              __setup_vector_irq		   native_teardown_msi_irq
                lock(vector_lock)		     destroy_irq 
                install vectors
                unlock(vector_lock)
      					       lock(vector_lock)
      --->                                  	       __clear_irq_vector
                                          	       unlock(vector_lock)
          lock(vector_lock)
          set_cpu_online
          unlock(vector_lock)
      
      This leaves the irq vector(s) which are torn down on CPU M stale in
      the vector array of CPU N, because CPU M does not see CPU N online
      yet. There is a similar issue with concurrent newly setup interrupts.
      
      The alloc/free protection of irq descriptors does not prevent the
      above race, because it merily prevents interrupt descriptors from
      going away or changing concurrently.
      
      Prevent this by moving the call to setup_vector_irq() into the
      vector_lock held region which protects set_cpu_online():
      
      cpu N				cpu M
      native_cpu_up                   device_shutdown
        do_boot_cpu			  free_msi_irqs
        start_secondary                   arch_teardown_msi_irqs
          smp_callin                        default_teardown_msi_irqs
             lock(vector_lock)                arch_teardown_msi_irq
             setup_vector_irq()
              __setup_vector_irq		   native_teardown_msi_irq
                install vectors		     destroy_irq 
             set_cpu_online
             unlock(vector_lock)
      					       lock(vector_lock)
                                        	       __clear_irq_vector
                                          	       unlock(vector_lock)
      
      So cpu M either sees the cpu N online before clearing the vector or
      cpu N installs the vectors after cpu M has cleared it.
      Reported-by: Nxiao jin <jin.xiao@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Link: http://lkml.kernel.org/r/20150705171102.141898931@linutronix.de
      5a3f75e3
  4. 06 7月, 2015 6 次提交
    • S
      x86/earlyprintk: Allow early_printk() to use console style parameters like '115200n8' · 827a82ff
      Steven Rostedt 提交于
      When I enable early_printk on a kernel, I cut and paste the
      console= input and add to earlyprintk parameter. But I notice
      recently that ktest has not been detecting triple faults. The
      way it detects it, is by seeing the kernel banner "Linux version
      .." with a different kernel version pop up. Then I noticed that
      early printk was no longer working on my console, which was why
      ktest was not seeing it.
      
      I bisected it down and it was added to 4.0 with this commit:
      
        ea9e9d80 ("Specify PCI based UART for earlyprintk")
      
      because it converted the simple_strtoul() that converts the baud
      number into a kstrtoul(). The problem with this is, I had as my
      baud rate, 115200n8 (acceptable for console=ttyS0), but because
      of the "n8", the kstrtoul() doesn't parse the baud rate and
      returns an error, which sets the baud rate to the default 9600.
      This explains the garbage on my screen.
      
      Now, earlyprintk= kernel parameter does not say it accepts that
      format. Thus, one answer would simply be me changing my kernel
      parameters to remove the "n8" since it isn't parsed anyway. But
      I wonder if other people run into this, and it seems strange
      that the two consoles for serial accepts different input.
      
      I could also extend this to have earlyprintk do something with
      that "n8" or whatever it has and have it match the console
      parsing (which, BTW, still uses simple_strtoul(), as I guess it
      has to).
      
      This patch just makes my old kernel parameter parsing work like
      it use to.
      
      Although, simple_strtoull() is considered obsolete, it is the
      only standard string parsing function that parses a number that
      is attached to text. Ironically, commit ea9e9d80 also added
      several calls to simple_strtoul()!
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Cohen <david.a.cohen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stuart R. Anderson <stuart.r.anderson@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150706101434.5f6a351b@gandalf.local.home
      [ Cleaned it up a bit. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      827a82ff
    • Z
      x86/espfix: Init espfix on the boot CPU side · 20d5e4a9
      Zhu Guihua 提交于
      As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
      called before we enable local irqs, so the lockdep sub-system
      would (correctly) trigger a warning about the potentially
      blocking API.
      
      So we allocate them on the boot CPU side when the secondary CPU is
      brought up by the boot CPU, and hand them over to the secondary
      CPU.
      
      And we use alloc_pages_node() with the secondary CPU's node, to
      make sure the espfix stack is NUMA-local to the CPU that is
      going to use it.
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Cc: <bp@alien8.de>
      Cc: <luto@amacapital.net>
      Cc: <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/c97add2670e9abebb90095369f0cfc172373ac94.1435824469.git.zhugh.fnst@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      20d5e4a9
    • Z
      x86/espfix: Add 'cpu' parameter to init_espfix_ap() · 1db87563
      Zhu Guihua 提交于
      Add a CPU index parameter to init_espfix_ap(), so that the
      parameter could be propagated to the function for espfix
      page allocation.
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Cc: <bp@alien8.de>
      Cc: <luto@amacapital.net>
      Cc: <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/cde3fcf1b3211f3f03feb1a995bce3fee850f0fc.1435824469.git.zhugh.fnst@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1db87563
    • A
      x86/kasan: Fix KASAN shadow region page tables · 5d5aa3cf
      Alexander Popov 提交于
      Currently KASAN shadow region page tables created without
      respect of physical offset (phys_base). This causes kernel halt
      when phys_base is not zero.
      
      So let's initialize KASAN shadow region page tables in
      kasan_early_init() using __pa_nodebug() which considers
      phys_base.
      
      This patch also separates x86_64_start_kernel() from KASAN low
      level details by moving kasan_map_early_shadow(init_level4_pgt)
      into kasan_early_init().
      
      Remove the comment before clear_bss() which stopped bringing
      much profit to the code readability. Otherwise describing all
      the new order dependencies would be too verbose.
      Signed-off-by: NAlexander Popov <alpopov@ptsecurity.com>
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: <stable@vger.kernel.org> # 4.0+
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1435828178-10975-3-git-send-email-a.ryabinin@samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5d5aa3cf
    • A
      x86/init: Clear 'init_level4_pgt' earlier · d0f77d4d
      Andrey Ryabinin 提交于
      Currently x86_64_start_kernel() has two KASAN related
      function calls. The first call maps shadow to early_level4_pgt,
      the second maps shadow to init_level4_pgt.
      
      If we move clear_page(init_level4_pgt) earlier, we could hide
      KASAN low level detail from generic x86_64 initialization code.
      The next patch will do it.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: <stable@vger.kernel.org> # 4.0+
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1435828178-10975-2-git-send-email-a.ryabinin@samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d0f77d4d
    • A
      x86/tsc: Let high latency PIT fail fast in quick_pit_calibrate() · 5aac644a
      Adrian Hunter 提交于
      If it takes longer than 12us to read the PIT counter lsb/msb,
      then the error margin will never fall below 500ppm within 50ms,
      and Fast TSC calibration will always fail.
      
      This patch detects when that will happen and fails fast. Note
      the failure message is not printed in that case because:
      1. it will always happen on that class of hardware
      2. the absence of the message is more informative than its
      presence
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/556EB717.9070607@intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5aac644a
  5. 04 7月, 2015 1 次提交
    • I
      x86/fpu: Fix boot crash in the early FPU code · b96fecbf
      Ingo Molnar 提交于
      Jan Kara and Thomas Gleixner reported boot crashes in the FPU
      code:
      
        general protection fault: 0000 [#1] SMP
        RIP: 0010:[<ffffffff81048a6c>]  [<ffffffff81048a6c>] mxcsr_feature_mask_init+0x1c/0x40
      
        2b:*  0f ae 85 00 fe ff ff    fxsave -0x200(%rbp)
      
      and bisected it down to the following FPU commit:
      
         91a8c2a5 ("x86/fpu: Clean up and fix MXCSR handling")
      
      The reason is that the on-stack FPU registers state variable,
      used by the FXSAVE instruction, did not have the required
      minimum alignment of 16 bytes, causing the general protection
      fault.
      
      This is most likely a GCC bug in older GCC versions, but the
      offending commit also added a bogus extra 32-byte alignment
      (which GCC ignored too).
      
      So fix this bug by making the variable static again, but also
      mark it __initdata this time, because fpu__init_system_mxcsr()
      is now an __init function.
      Reported-and-bisected-by: NJan Kara <jack@suse.cz>
      Reported-bisected-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150704075819.GA9201@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b96fecbf
  6. 01 7月, 2015 2 次提交
  7. 30 6月, 2015 2 次提交
    • P
      perf/x86: Fix 'active_events' imbalance · 93472aff
      Peter Zijlstra 提交于
      Commit 1b7b938f ("perf/x86/intel: Fix PMI handling for Intel PT") conditionally
      increments active_events in x86_add_exclusive() but unconditionally decrements in
      x86_del_exclusive().
      
      These extra decrements can lead to the situation where
      active_events is zero and thus the PMI handler is 'disabled'
      while we have active events on the PMU generating PMIs.
      
      This leads to a truckload of:
      
        Uhhuh. NMI received for unknown reason 21 on CPU 28.
        Do you have a strange power saving mode enabled?
        Dazed and confused, but trying to continue
      
      messages and generally messes up perf.
      
      Remove the condition on the increment, double increment balanced
      by a double decrement is perfectly fine.
      
      Restructure the code a little bit to make the unconditional inc
      a bit more natural.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alexander.shishkin@linux.intel.com
      Cc: brgerst@gmail.com
      Cc: dvlasenk@redhat.com
      Cc: luto@amacapital.net
      Cc: oleg@redhat.com
      Fixes: 1b7b938f ("perf/x86/intel: Fix PMI handling for Intel PT")
      Link: http://lkml.kernel.org/r/20150624144750.GJ18673@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      93472aff
    • I
      x86/fpu: Fix FPU related boot regression when CPUID masking BIOS feature is enabled · db52ef74
      Ingo Molnar 提交于
      Mike Galbraith reported:
      
        " My i7-4790 box is having one hell of a time with this merge
          window, dead in the water.
      
          BIOS setting "Limit CPUID Maximum" upsets new fpu code
          mightily. "
      
      It turns out that Linux does a double workaround here, as per:
      
        066941bd ("x86: unmask CPUID levels on Intel CPUs")
      
      it undoes the BIOS workaround - but as a side effect the CPUID
      state is not completely constant during early init anymore,
      and the new FPU init code did not take this into account.
      
      So what happened is that the xstate init code did not have full
      CPUID available, which broke subsequent attempts to use xstate
      features.
      
      Fix this by ordering the early FPU init code to after we've
      stabilized the CPUID state.
      Reported-bisected-and-tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150627082514.GA10894@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      db52ef74
  8. 26 6月, 2015 1 次提交
    • T
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani 提交于
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41d7a6d6
  9. 25 6月, 2015 3 次提交
    • D
      libnvdimm, pmem: add libnvdimm support to the pmem driver · 9f53f9fa
      Dan Williams 提交于
      nd_pmem attaches to persistent memory regions and namespaces emitted by
      the libnvdimm subsystem, and, same as the original pmem driver, presents
      the system-physical-address range as a block device.
      
      The existing e820-type-12 to pmem setup is converted to an nvdimm_bus
      that emits an nd_namespace_io device.
      
      Note that the X in 'pmemX' is now derived from the parent region.  This
      provides some stability to the pmem devices names from boot-to-boot.
      The minor numbers are also more predictable by passing 0 to
      alloc_disk().
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      9f53f9fa
    • T
      x86, mirror: x86 enabling - find mirrored memory ranges · b05b9f5f
      Tony Luck 提交于
      UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
      address ranges.  See UEFI 2.5 spec pages 157-158:
      
        http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf
      
      On EFI enabled systems scan the memory map and tell memblock about any
      mirrored ranges.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b05b9f5f
    • T
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck 提交于
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
  10. 22 6月, 2015 1 次提交
  11. 21 6月, 2015 1 次提交
  12. 20 6月, 2015 1 次提交
  13. 19 6月, 2015 5 次提交
    • B
      x86/boot: Fix overflow warning with 32-bit binutils · 04c17341
      Borislav Petkov 提交于
      When building the kernel with 32-bit binutils built with support
      only for the i386 target, we get the following warning:
      
        arch/x86/kernel/head_32.S:66: Warning: shift count out of range (32 is not between 0 and 31)
      
      The problem is that in that case, binutils' internal type
      representation is 32-bit wide and the shift range overflows.
      
      In order to fix this, manipulate the shift expression which
      creates the 4GiB constant to not overflow the shift count.
      Suggested-by: NMichael Matz <matz@suse.de>
      Reported-and-tested-by: NEnrico Mioso <mrkiko.rs@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      04c17341
    • P
      perf/x86: Honor the architectural performance monitoring version · 2c33645d
      Palik, Imre 提交于
      Architectural performance monitoring, version 1, doesn't support fixed counters.
      
      Currently, even if a hypervisor advertises support for architectural
      performance monitoring version 1, perf may still try to use the fixed
      counters, as the constraints are set up based on the CPU model.
      
      This patch ensures that perf honors the architectural performance monitoring
      version returned by CPUID, and it only uses the fixed counters for version 2
      and above.
      
      (Some of the ideas in this patch came from Peter Zijlstra.)
      Signed-off-by: NImre Palik <imrep@amazon.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Anthony Liguori <aliguori@amazon.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433767609-1039-1-git-send-email-imrep.amz@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c33645d
    • A
      perf/x86/intel: Fix PMI handling for Intel PT · 1b7b938f
      Alexander Shishkin 提交于
      Intel PT is a separate PMU and it is not using any of the x86_pmu
      code paths, which means in particular that the active_events counter
      remains intact when new PT events are created.
      
      However, PT uses the generic x86_pmu PMI handler for its PMI handling needs.
      
      The problem here is that the latter checks active_events and in case of it
      being zero, exits without calling the actual x86_pmu.handle_nmi(), which
      results in unknown NMI errors and massive data loss for PT.
      
      The effect is not visible if there are other perf events in the system
      at the same time that keep active_events counter non-zero, for instance
      if the NMI watchdog is running, so one needs to disable it to reproduce
      the problem.
      
      At the same time, the active_events counter besides doing what the name
      suggests also implicitly serves as a PMC hardware and DS area reference
      counter.
      
      This patch adds a separate reference counter for the PMC hardware, leaving
      active_events for actually counting the events and makes sure it also
      counts PT and BTS events.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Link: http://lkml.kernel.org/r/87k2v92t0s.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1b7b938f
    • A
      perf/x86/intel/bts: Fix DS area sharing with x86_pmu events · 6b099d9b
      Alexander Shishkin 提交于
      Currently, the intel_bts driver relies on the DS area allocated by the x86_pmu
      code in its event_init() path, which is a bug: creating a BTS event while
      no x86_pmu events are present results in a NULL pointer dereference.
      
      The same DS area is also used by PEBS sampling, which makes it quite a bit
      trickier to have a separate one for intel_bts' purposes.
      
      This patch makes intel_bts driver use the same DS allocation and reference
      counting code as x86_pmu to make sure it is always present when either
      intel_bts or x86_pmu need it.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Link: http://lkml.kernel.org/r/1434024837-9916-2-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6b099d9b
    • A
      perf/x86: Add more Broadwell model numbers · 4b36f1a4
      Andi Kleen 提交于
      This patch adds additional model numbers for Broadwell to perf.
      Support for Broadwell with Iris Pro (Intel Core i7-57xxC)
      and support for Broadwell Server Xeon.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1434055942-28253-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4b36f1a4
  14. 18 6月, 2015 2 次提交
  15. 17 6月, 2015 5 次提交
    • P
      x86: replace __init_or_module with __init in non-modular vsmp_64.c · 70c4f78b
      Paul Gortmaker 提交于
      The __init_or_module is from commit 05e12e1c
      ("x86: fix 27-rc crash on vsmp due to paravirt during module load").
      
      But as of commit 70511134
      ("Revert "x86: don't compile vsmp_64 for 32bit") this file became
      obj-y and hence is now only for built-in.  That makes any
      "_or_module" support no longer necessary.
      
      We need to distinguish between the two in order to do some header
      reorganization between init.h and module.h and we don't want to
      be including module.h in non-modular code.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      70c4f78b
    • P
      x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling · 5b00c1eb
      Paul Gortmaker 提交于
      This was using module_init, but the current Kconfig situation is
      as follows:
      
      In arch/x86/kernel/cpu/Makefile:
      
        obj-$(CONFIG_CPU_SUP_INTEL)    += perf_event_intel_pt.o perf_event_intel_bts.o
      
      and in arch/x86/Kconfig.cpu:
      
        config CPU_SUP_INTEL
              default y
              bool "Support Intel processors" if PROCESSOR_SELECT
      
      So currently, the end user can not build this code into a module.
      If in the future, there is desire for this to be modular, then
      it can be changed to include <linux/module.h> and use module_init.
      
      But currently, in the non-modular case, a module_init becomes a
      device_initcall.  But this really isn't a device, so we should
      choose a more appropriate initcall bucket to put it in.
      
      The obvious choice here seems to be arch_initcall, but that does
      make it earlier than it was currently through device_initcall.
      As long as perf_pmu_register() is functional, we should be OK.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      5b00c1eb
    • P
      x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling · ca41d24c
      Paul Gortmaker 提交于
      This was using module_init, but the current Kconfig situation is
      as follows:
      
      In arch/x86/kernel/cpu/Makefile:
      
        obj-$(CONFIG_CPU_SUP_INTEL)    += perf_event_intel_pt.o perf_event_intel_bts.o
      
      and in arch/x86/Kconfig.cpu:
      
        config CPU_SUP_INTEL
              default y
              bool "Support Intel processors" if PROCESSOR_SELECT
      
      So currently, the end user can not build this code into a module.
      If in the future, there is desire for this to be modular, then
      it can be changed to include <linux/module.h> and use module_init.
      
      But currently, in the non-modular case, a module_init becomes a
      device_initcall.  But this really isn't a device, so we should
      choose a more appropriate initcall bucket to put it in.
      
      The obvious choice here seems to be arch_initcall, but that does
      make it earlier than it was currently through device_initcall.
      As long as perf_pmu_register() is functional, we should be OK.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      ca41d24c
    • P
      x86: don't use module_init for non-modular core bootflag code · 1206f535
      Paul Gortmaker 提交于
      The bootflag.o is obj-y (always built in).  It will never be
      modular, so using module_init as an alias for __initcall is
      somewhat misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of arch_initcall (which
      makes sense for arch code) will thus change this registration
      from level 6-device to level 3-arch (i.e. slightly earlier).
      However no observable impact of that small difference has
      been observed during testing, or is expected.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      1206f535
    • P
      x86: don't use module_init in non-modular devicetree.c code · d54b675a
      Paul Gortmaker 提交于
      The devicetree.o is built for "OF" -- which is bool, and hence
      this code is either present or absent.  It will never be modular,
      so using module_init as an alias for __initcall can be somewhat
      misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of device_initcall
      directly in this change means that the runtime impact is
      zero -- it will remain at level 6 in initcall ordering.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      d54b675a
  16. 12 6月, 2015 1 次提交
    • D
      x86/fpu: Fix double-increment in setup_xstate_features() · a8424003
      Dave Hansen 提交于
      I noticed that my MPX tracepoints were producing garbage for the
      lower and upper bounds:
      
      	mpx_bounds_register_exception: address referenced: 0x00007fffffffccb7 bounds: lower: 0x0 ~upper: 0xffffffffffffffff
      	mpx_bounds_register_exception: address referenced: 0x00007fffffffccbf bounds: lower: 0x0 ~upper: 0xffffffffffffffff
      
      This is, of course, bogus because 0x00007fffffffccbf is *within*
      the bounds.  I assumed that my instruction decoder was bad and
      went looking at it.  But I eventually realized that I was
      getting a '0' offset back from xstate_offsets[BNDREGS].
      
      It was being skipped in the initialization, which is obviously
      bogus, so remove the extra leaf++.
      
      This also goes an initializes xstate_offsets/sizes[] to -1 so
      so that bugs like this will oops instead of silently failing
      in interesting ways.
      
      This was introduced by:
      
      	39f1acd2 ("x86/fpu/xstate: Don't assume the first zero xfeatures zero bit means the end")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@sr71.net
      Link: http://lkml.kernel.org/r/20150611193400.2E0B00DB@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8424003
  17. 11 6月, 2015 2 次提交
    • J
      x86/crash: Allocate enough low memory when crashkernel=high · 94fb9334
      Joerg Roedel 提交于
      When the crash kernel is loaded above 4GiB in memory, the
      first kernel allocates only 72MiB of low-memory for the DMA
      requirements of the second kernel. On systems with many
      devices this is not enough and causes device driver
      initialization errors and failed crash dumps. Testing by
      SUSE and Redhat has shown that 256MiB is a good default
      value for now and the discussion has lead to this value as
      well. So set this default value to 256MiB to make sure there
      is enough memory available for DMA.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      [ Reflow comment. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jörg Rödel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kexec@lists.infradead.org
      Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      94fb9334
    • J
      x86/swiotlb: Try coherent allocations with __GFP_NOWARN · 186dfc9d
      Joerg Roedel 提交于
      When we boot a kdump kernel in high memory, there is by
      default only 72MB of low memory available. The swiotlb code
      takes 64MB of it (by default) so that there are only 8MB
      left to allocate from. On systems with many devices this
      causes page allocator warnings from
      dma_generic_alloc_coherent():
      
        systemd-udevd: page allocation failure: order:0, mode:0x280d4
        CPU: 0 PID: 197 Comm: systemd-udevd Tainted: G        W
        3.12.28-4-default #1 Hardware name: HP ProLiant DL980 G7, BIOS
        P66 07/30/2012  ffff8800781335e0 ffffffff8150b1db 00000000000280d4 ffffffff8113af90
         0000000000000000 0000000000000000 ffff88007efdbb00 0000000100000000
         0000000000000000 0000000000000000 0000000000000000 0000000000000001
        Call Trace:
          dump_trace+0x7d/0x2d0
          show_stack_log_lvl+0x94/0x170
          show_stack+0x21/0x50
          dump_stack+0x41/0x51
          warn_alloc_failed+0xf0/0x160
          __alloc_pages_slowpath+0x72f/0x796
          __alloc_pages_nodemask+0x1ea/0x210
          dma_generic_alloc_coherent+0x96/0x140
          x86_swiotlb_alloc_coherent+0x1c/0x50
          ttm_dma_pool_alloc_new_pages+0xab/0x320 [ttm]
          ttm_dma_populate+0x3ce/0x640 [ttm]
          ttm_tt_bind+0x36/0x60 [ttm]
          ttm_bo_handle_move_mem+0x55f/0x5c0 [ttm]
          ttm_bo_move_buffer+0x105/0x130 [ttm]
          ttm_bo_validate+0xc1/0x130 [ttm]
          ttm_bo_init+0x24b/0x400 [ttm]
          radeon_bo_create+0x16c/0x200 [radeon]
          radeon_ring_init+0x11e/0x2b0 [radeon]
          r100_cp_init+0x123/0x5b0 [radeon]
          r100_startup+0x194/0x230 [radeon]
          r100_init+0x223/0x410 [radeon]
          radeon_device_init+0x6af/0x830 [radeon]
          radeon_driver_load_kms+0x89/0x180 [radeon]
          drm_get_pci_dev+0x121/0x2f0 [drm]
          local_pci_probe+0x39/0x60
          pci_device_probe+0xa9/0x120
          driver_probe_device+0x9d/0x3d0
          __driver_attach+0x8b/0x90
          bus_for_each_dev+0x5b/0x90
          bus_add_driver+0x1f8/0x2c0
          driver_register+0x5b/0xe0
          do_one_initcall+0xf2/0x1a0
          load_module+0x1207/0x1c70
          SYSC_finit_module+0x75/0xa0
          system_call_fastpath+0x16/0x1b
          0x7fac533d2788
      
      After these warnings the code enters a fall-back path and
      allocated directly from the swiotlb aperture in the end.
      So remove these warnings as this is not a fatal error.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      [ Simplify, reflow comment. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jörg Rödel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kexec@lists.infradead.org
      Link: http://lkml.kernel.org/r/1433500202-25531-3-git-send-email-joro@8bytes.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      186dfc9d
  18. 09 6月, 2015 1 次提交