1. 30 9月, 2016 9 次提交
  2. 27 9月, 2016 1 次提交
  3. 26 9月, 2016 2 次提交
  4. 24 9月, 2016 1 次提交
  5. 22 9月, 2016 6 次提交
    • A
      perf/x86/intel/bts: Make it an exclusive PMU · 08b90f06
      Alexander Shishkin 提交于
      Just like intel_pt, intel_bts can only handle one event at a time,
      which is the reason we introduced PERF_PMU_CAP_EXCLUSIVE in the first
      place. However, at the moment one can have as many intel_bts events
      within the same context at the same time as one pleases. Only one of
      them, however, will get scheduled and receive the actual trace data.
      
      Fix this by making intel_bts an "exclusive" PMU.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20160920154811.3255-2-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      08b90f06
    • D
      x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation · 917db484
      Dan Williams 提交于
      In commit:
      
        ec776ef6 ("x86/mm: Add support for the non-standard protected e820 type")
      
      Christoph references the original patch I wrote implementing pmem support.
      The intent of the 'max_pfn' changes in that commit were to enable persistent
      memory ranges to be covered by the struct page memmap by default.
      
      However, that approach was abandoned when Christoph ported the patches [1], and
      that functionality has since been replaced by devm_memremap_pages().
      
      In the meantime, this max_pfn manipulation is confusing kdump [2] that
      assumes that everything covered by the max_pfn is "System RAM".  This
      results in kdump hanging or crashing.
      
       [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-March/000348.html
       [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1351098
      
      So fix it.
      Reported-by: NZhang Yi <yizhan@redhat.com>
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Tested-by: NZhang Yi <yizhan@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@vger.kernel.org> # v4.1 and later kernels
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-nvdimm@lists.01.org
      Fixes: ec776ef6 ("x86/mm: Add support for the non-standard protected e820 type")
      Link: http://lkml.kernel.org/r/147448744538.34910.11287693517367139607.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      917db484
    • G
      x86/acpi: Set persistent cpuid <-> nodeid mapping when booting · dc6db24d
      Gu Zheng 提交于
      The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
      when node online/offline happens, cache based on cpuid <-> nodeid mapping such as
      wq_numa_possible_cpumask will not cause any problem.
      It contains 4 steps:
      1. Enable apic registeration flow to handle both enabled and disabled cpus.
      2. Introduce a new array storing all possible cpuid <-> apicid mapping.
      3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid.
      4. Establish all possible cpuid <-> nodeid mapping.
      
      This patch finishes step 4.
      
      This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
      processors at boot time via an additional acpi namespace walk for processors.
      
      [ tglx: Remove the unneeded exports ]
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: mika.j.penttila@gmail.com
      Cc: len.brown@intel.com
      Cc: rafael@kernel.org
      Cc: rjw@rjwysocki.net
      Cc: yasu.isimatu@gmail.com
      Cc: linux-mm@kvack.org
      Cc: linux-acpi@vger.kernel.org
      Cc: isimatu.yasuaki@jp.fujitsu.com
      Cc: gongzhaogang@inspur.com
      Cc: tj@kernel.org
      Cc: izumi.taku@jp.fujitsu.com
      Cc: cl@linux.com
      Cc: chen.tang@easystack.cn
      Cc: akpm@linux-foundation.org
      Cc: kamezawa.hiroyu@jp.fujitsu.com
      Cc: lenb@kernel.org
      Link: http://lkml.kernel.org/r/1472114120-3281-6-git-send-email-douly.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      dc6db24d
    • G
      x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping · 8f54969d
      Gu Zheng 提交于
      The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
      when node online/offline happens, cache based on cpuid <-> nodeid mapping such as
      wq_numa_possible_cpumask will not cause any problem.
      It contains 4 steps:
      1. Enable apic registeration flow to handle both enabled and disabled cpus.
      2. Introduce a new array storing all possible cpuid <-> apicid mapping.
      3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid.
      4. Establish all possible cpuid <-> nodeid mapping.
      
      This patch finishes step 2.
      
      In this patch, we introduce a new static array named cpuid_to_apicid[],
      which is large enough to store info for all possible cpus.
      
      And then, we modify the cpuid calculation. In generic_processor_info(),
      it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
      mapping changes with node hotplug.
      
      After this patch, we find the next unused cpuid, map it to an apicid,
      and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
      mapping will be persistent.
      
      And finally we will use this array to make cpuid <-> nodeid persistent.
      
      cpuid <-> apicid mapping is established at local apic registeration time.
      But non-present or disabled cpus are ignored.
      
      In this patch, we establish all possible cpuid <-> apicid mapping when
      registering local apic.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: mika.j.penttila@gmail.com
      Cc: len.brown@intel.com
      Cc: rafael@kernel.org
      Cc: rjw@rjwysocki.net
      Cc: yasu.isimatu@gmail.com
      Cc: linux-mm@kvack.org
      Cc: linux-acpi@vger.kernel.org
      Cc: isimatu.yasuaki@jp.fujitsu.com
      Cc: gongzhaogang@inspur.com
      Cc: tj@kernel.org
      Cc: izumi.taku@jp.fujitsu.com
      Cc: cl@linux.com
      Cc: chen.tang@easystack.cn
      Cc: akpm@linux-foundation.org
      Cc: kamezawa.hiroyu@jp.fujitsu.com
      Cc: lenb@kernel.org
      Link: http://lkml.kernel.org/r/1472114120-3281-4-git-send-email-douly.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8f54969d
    • G
      x86/acpi: Enable acpi to register all possible cpus at boot time · f7c28833
      Gu Zheng 提交于
      cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
      the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.
      
      When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
      which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
      workqueue does not update wq_numa_possible_cpumask.
      
      So here is the problem:
      
      Assume we have the following cpuid <-> nodeid in the beginning:
      
        Node | CPU
      
      ------------------------
      node 0 |  0-14, 60-74
      node 1 | 15-29, 75-89
      node 2 | 30-44, 90-104
      node 3 | 45-59, 105-119
      
      and we hot-remove node2 and node3, it becomes:
      
        Node | CPU
      ------------------------
      node 0 |  0-14, 60-74
      node 1 | 15-29, 75-89
      
      and we hot-add node4 and node5, it becomes:
      
        Node | CPU
      ------------------------
      node 0 |  0-14, 60-74
      node 1 | 15-29, 75-89
      node 4 | 30-59
      node 5 | 90-119
      
      But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.
      
      When a pool workqueue is initialized, if its cpumask belongs to a node, its
      pool->node will be mapped to that node. And memory used by this workqueue will
      also be allocated on that node.
      
      static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
      ...
              /* if cpumask is contained inside a NUMA node, we belong to that node */
              if (wq_numa_enabled) {
                      for_each_node(node) {
                              if (cpumask_subset(pool->attrs->cpumask,
                                                 wq_numa_possible_cpumask[node])) {
                                      pool->node = node;
                                      break;
                              }
                      }
              }
      
      Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
      which will lead to memory allocation failure:
      
       SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
        cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
        node 0: slabs: 6172, objs: 259224, free: 245741
        node 1: slabs: 3261, objs: 136962, free: 127656
      
      It happens here:
      
      create_worker(struct worker_pool *pool)
       |--> worker = alloc_worker(pool->node);
      
      static struct worker *alloc_worker(int node)
      {
              struct worker *worker;
      
              worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.
      
              ......
      
              return worker;
      }
      
      [Solution]
      
      There are four mappings in the kernel:
      1. nodeid (logical node id)   <->   pxm
      2. apicid (physical cpu id)   <->   nodeid
      3. cpuid (logical cpu id)     <->   apicid
      4. cpuid (logical cpu id)     <->   nodeid
      
      1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
         mapping is setup at boot time. This mapping is persistent, won't change.
      
      2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
         time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
         persistent.
      
      3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
         allocated, lower ids first, and released at CPU hotremove time, reused for other
         hotadded CPUs. So this mapping is not persistent.
      
      4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
         cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.
      
      To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
      cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
      cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
      mapping. So the key point is obtaining all cpus' apicid.
      
      apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
      MADT (Multiple APIC Description Table). So we finish the job in the following steps:
      
      1. Enable apic registeration flow to handle both enabled and disabled cpus.
         This is done by introducing an extra parameter to generic_processor_info to let the
         caller control if disabled cpus are ignored.
      
      2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
         the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
         registering local apic. Store the mapping in this array.
      
      3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid.
         This is also done by introducing an extra parameter to these apis to let the caller
         control if disabled cpus are ignored.
      
      4. Establish all possible cpuid <-> nodeid mapping.
         This is done via an additional acpi namespace walk for processors.
      
      This patch finished step 1.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: mika.j.penttila@gmail.com
      Cc: len.brown@intel.com
      Cc: rafael@kernel.org
      Cc: rjw@rjwysocki.net
      Cc: yasu.isimatu@gmail.com
      Cc: linux-mm@kvack.org
      Cc: linux-acpi@vger.kernel.org
      Cc: isimatu.yasuaki@jp.fujitsu.com
      Cc: gongzhaogang@inspur.com
      Cc: tj@kernel.org
      Cc: izumi.taku@jp.fujitsu.com
      Cc: cl@linux.com
      Cc: chen.tang@easystack.cn
      Cc: akpm@linux-foundation.org
      Cc: kamezawa.hiroyu@jp.fujitsu.com
      Cc: lenb@kernel.org
      Link: http://lkml.kernel.org/r/1472114120-3281-3-git-send-email-douly.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f7c28833
    • T
      x86/numa: Online memory-less nodes at boot time · 2532fc31
      Tang Chen 提交于
      For now, x86 does not support memory-less node. A node without memory
      will not be onlined, and the cpus on it will be mapped to the other
      online nodes with memory in init_cpu_to_node(). The reason of doing this
      is to ensure each cpu has mapped to a node with memory, so that it will
      be able to allocate local memory for that cpu.
      
      But we don't have to do it in this way.
      
      In this series of patches, we are going to construct cpu <-> node mapping
      for all possible cpus at boot time, which is a persistent mapping. It means
      that the cpu will be mapped to the node which it belongs to, and will never
      be changed. If a node has only cpus but no memory, the cpus on it will be
      mapped to a memory-less node. And the memory-less node should be onlined.
      
      Allocate pgdats for all memory-less nodes and online them at boot
      time. Then build zonelists for these nodes. As a result, when cpus on these
      memory-less nodes try to allocate memory from local node, it will
      automatically fall back to the proper zones in the zonelists.
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: mika.j.penttila@gmail.com
      Cc: len.brown@intel.com
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: rafael@kernel.org
      Cc: rjw@rjwysocki.net
      Cc: yasu.isimatu@gmail.com
      Cc: linux-mm@kvack.org
      Cc: linux-acpi@vger.kernel.org
      Cc: isimatu.yasuaki@jp.fujitsu.com
      Cc: gongzhaogang@inspur.com
      Cc: tj@kernel.org
      Cc: izumi.taku@jp.fujitsu.com
      Cc: cl@linux.com
      Cc: chen.tang@easystack.cn
      Cc: akpm@linux-foundation.org
      Cc: kamezawa.hiroyu@jp.fujitsu.com
      Cc: lenb@kernel.org
      Link: http://lkml.kernel.org/r/1472114120-3281-2-git-send-email-douly.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2532fc31
  6. 21 9月, 2016 4 次提交
  7. 20 9月, 2016 15 次提交
    • M
      x86/efi: Round EFI memmap reservations to EFI_PAGE_SIZE · 92dc3350
      Matt Fleming 提交于
      Mike Galbraith reported that his machine started rebooting during boot
      after,
      
        commit 8e80632f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")
      
      The ESRT table on his machine is 56 bytes and at no point in the
      efi_arch_mem_reserve() call path is that size rounded up to
      EFI_PAGE_SIZE, nor is the start address on an EFI_PAGE_SIZE boundary.
      
      Since the EFI memory map only deals with whole pages, inserting an EFI
      memory region with 56 bytes results in a new entry covering zero
      pages, and completely screws up the calculations for the old regions
      that were trimmed.
      
      Round all sizes upwards, and start addresses downwards, to the nearest
      EFI_PAGE_SIZE boundary.
      
      Additionally, efi_memmap_insert() expects the mem::range::end value to
      be one less than the end address for the region.
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Reported-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Tested-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      92dc3350
    • S
      perf/x86/intel/bts: Make sure debug store is valid · f1e1c9e5
      Sebastian Andrzej Siewior 提交于
      Since commit 4d4c4741 ("perf/x86/intel/bts: Fix BTS PMI detection")
      my box goes boom on boot:
      
      | .... node  #0, CPUs:      #1 #2 #3 #4 #5 #6 #7
      | BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      | IP: [<ffffffff8100c463>] intel_bts_interrupt+0x43/0x130
      | Call Trace:
      |  <NMI> d [<ffffffff8100b341>] intel_pmu_handle_irq+0x51/0x4b0
      |  [<ffffffff81004d47>] perf_event_nmi_handler+0x27/0x40
      
      This happens because the code introduced in this commit dereferences the
      debug store pointer unconditionally. The debug store is not guaranteed to
      be available, so a NULL pointer check as on other places is required.
      
      Fixes: 4d4c4741 ("perf/x86/intel/bts: Fix BTS PMI detection")
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: vince@deater.net
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/20160920131220.xg5pbdjtznszuyzb@breakpoint.ccSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f1e1c9e5
    • M
      x86/efi: Only map RAM into EFI page tables if in mixed-mode · 12976670
      Matt Fleming 提交于
      Waiman reported that booting with CONFIG_EFI_MIXED enabled on his
      multi-terabyte HP machine results in boot crashes, because the EFI
      region mapping functions loop forever while trying to map those
      regions describing RAM.
      
      While this patch doesn't fix the underlying hang, there's really no
      reason to map EFI_CONVENTIONAL_MEMORY regions into the EFI page tables
      when mixed-mode is not in use at runtime.
      Reported-by: NWaiman Long <waiman.long@hpe.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      CC: Theodore Ts'o <tytso@mit.edu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: <stable@vger.kernel.org> # v4.6+
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      12976670
    • M
      x86/mm/pat: Prevent hang during boot when mapping pages · e535ec08
      Matt Fleming 提交于
      There's a mixture of signed 32-bit and unsigned 32-bit and 64-bit data
      types used for keeping track of how many pages have been mapped.
      
      This leads to hangs during boot when mapping large numbers of pages
      (multiple terabytes, as reported by Waiman) because those values are
      interpreted as being negative.
      
      commit 74256377 ("x86/mm/pat: Avoid truncation when converting
      cpa->numpages to address") fixed one of those bugs, but there is
      another lurking in __change_page_attr_set_clr().
      
      Additionally, the return value type for the populate_*() functions can
      return negative values when a large number of pages have been mapped,
      triggering the error paths even though no error occurred.
      
      Consistently use 64-bit types on 64-bit platforms when counting pages.
      Even in the signed case this gives us room for regions 8PiB
      (pebibytes) in size whilst still allowing the usual negative value
      error checking idiom.
      Reported-by: NWaiman Long <waiman.long@hpe.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      CC: Theodore Ts'o <tytso@mit.edu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      e535ec08
    • J
      x86/dumpstack: Remove dump_trace() and related callbacks · c8fe4609
      Josh Poimboeuf 提交于
      All previous users of dump_trace() have been converted to use the new
      unwind interfaces, so we can remove it and the related
      print_context_stack() and print_context_stack_bp() callback functions.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/5b97da3572b40b5a4d8e185cf2429308d0987a13.1474045023.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c8fe4609
    • J
      x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder · e18bcccd
      Josh Poimboeuf 提交于
      Convert show_trace_log_lvl() to use the new unwinder.  dump_trace() has
      been deprecated.
      
      show_trace_log_lvl() is special compared to other users of the unwinder.
      It's the only place where both reliable *and* unreliable addresses are
      needed.  With frame pointers enabled, most callers of the unwinder don't
      want to know about unreliable addresses.  But in this case, when we're
      dumping the stack to the console because something presumably went
      wrong, the unreliable addresses are useful:
      
      - They show stale data on the stack which can provide useful clues.
      
      - If something goes wrong with the unwinder, or if frame pointers are
        corrupt or missing, all the stack addresses still get shown.
      
      So in order to show all addresses on the stack, and at the same time
      figure out which addresses are reliable, we have to do the scanning and
      the unwinding in parallel.
      
      The scanning is done with the help of get_stack_info() to traverse the
      stacks.  The unwinding is done separately by the new unwinder.
      
      In theory we could simplify show_trace_log_lvl() by instead pushing some
      of this logic into the unwind code.  But then we would need some kind of
      "fake" frame logic in the unwinder which would add a lot of complexity
      and wouldn't be worth it in order to support only one user.
      
      Another benefit of this approach is that once we have a DWARF unwinder,
      we should be able to just plug it in with minimal impact to this code.
      
      Another change here is that callers of show_trace_log_lvl() don't need
      to provide the 'bp' argument.  The unwinder already finds the relevant
      frame pointer by unwinding until it reaches the first frame after the
      provided stack pointer.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/703b5998604c712a1f801874b43f35d6dac52ede.1474045023.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e18bcccd
    • J
      oprofile/x86: Convert x86_backtrace() to use the new unwinder · ec2ad9cc
      Josh Poimboeuf 提交于
      Convert oprofile's x86_backtrace() to use the new unwinder.
      dump_trace() has been deprecated.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/412df8927705795e8ea60cffcf89a79e010713b1.1474045023.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec2ad9cc
    • J
      x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder · 49a612c6
      Josh Poimboeuf 提交于
      Convert save_stack_trace_*() to use the new unwinder.  dump_trace() has
      been deprecated.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/815494c627d89887db0ce56ceffd58ad16ee6c21.1474045023.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      49a612c6
    • J
      perf/x86: Convert perf_callchain_kernel() to use the new unwinder · 35f4d9b3
      Josh Poimboeuf 提交于
      Convert perf_callchain_kernel() to use the new unwinder.  dump_trace()
      has been deprecated.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/a2df0c4f09b3d438e11b41681f10b0775a819a7f.1474045023.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35f4d9b3
    • J
      x86/unwind: Add new unwind interface and implementations · 7c7900f8
      Josh Poimboeuf 提交于
      The x86 stack dump code is a bit of a mess.  dump_trace() uses
      callbacks, and each user of it seems to have slightly different
      requirements, so there are several slightly different callbacks floating
      around.
      
      Also there are some upcoming features which will need more changes to
      the stack dump code, including the printing of stack pt_regs, reliable
      stack detection for live patching, and a DWARF unwinder.  Each of those
      features would at least need more callbacks and/or callback interfaces,
      resulting in a much bigger mess than what we have today.
      
      Before doing all that, we should try to clean things up and replace
      dump_trace() with something cleaner and more flexible.
      
      The new unwinder is a simple state machine which was heavily inspired by
      a suggestion from Andy Lutomirski:
      
        https://lkml.kernel.org/r/CALCETrUbNTqaM2LRyXGRx=kVLRPeY5A3Pc6k4TtQxF320rUT=w@mail.gmail.com
      
      It's also similar to the libunwind API:
      
        http://www.nongnu.org/libunwind/man/libunwind(3).html
      
      Some if its advantages:
      
      - Simplicity: no more callback sprawl and less code duplication.
      
      - Flexibility: it allows the caller to stop and inspect the stack state
        at each step in the unwinding process.
      
      - Modularity: the unwinder code, console stack dump code, and stack
        metadata analysis code are all better separated so that changing one
        of them shouldn't have much of an impact on any of the others.
      
      Two implementations are added which conform to the new unwind interface:
      
      - The frame pointer unwinder which is used for CONFIG_FRAME_POINTER=y.
      
      - The "guess" unwinder which is used for CONFIG_FRAME_POINTER=n.  This
        isn't an "unwinder" per se.  All it does is scan the stack for kernel
        text addresses.  But with no frame pointers, guesses are better than
        nothing in most cases.
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/6dc2f909c47533d213d0505f0a113e64585bec82.1474045023.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7c7900f8
    • J
      locking/rwsem, x86: Drop a bogus cc clobber · c907420f
      Jan Beulich 提交于
      With the addition of uses of GCC's condition code outputs in commit:
      
        35ccfb71 ("x86, asm: Use CC_SET()/CC_OUT() in <asm/rwsem.h>")
      
      ... there's now an overlap of outputs and clobbers in __down_write_trylock().
      
      Such overlaps are generally getting tagged with an error (occasionally
      even with an ICE). I can't really tell why plain GCC 6.2 doesn't detect
      this (judging by the code it is meant to), while the slightly modified
      one I use does. Since condition code clobbers are never necessary on x86
      (other than perhaps for documentation purposes, which doesn't really
      get done consistently), remove it altogether rather than inventing
      something like CC_CLOBBER (to accompany CC_SET/CC_OUT).
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/57E003CC0200007800110102@prv-mh.provo.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c907420f
    • A
      perf/x86/intel/pt: Add support for PTWRITE and power event tracing · 8ee83b2a
      Alexander Shishkin 提交于
      The Intel PT facility grew some new functionality:
      
        * PTWRITE packet carries the payload of the new PTWRITE instruction
          that can be used to instrument Intel PT traces with user-supplied
          data. Packets of this type are only generated if 'ptwrite' capability
          is set and PTWEn bit is set in the event attribute's config. Flow
          update packets (FUP) can be generated on PTWRITE packets if FUPonPTW
          config bit is set. Setting these bits is not allowed if 'ptwrite'
          capability is not set.
      
        * PWRE, PWRX, MWAIT, EXSTOP packets communicate core power management
          events. These depend on 'power_event_tracing' capability and are
          enabled by setting PwrEvtEn bit in the event attribute.
      
      Extend the driver capabilities and provide the proper sanity checks in the
      event validation function.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: vince@deater.net
      Cc: eranian@google.com
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Link: http://lkml.kernel.org/r/20160916134819.1978-1-alexander.shishkin@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8ee83b2a
    • P
      x86: Migrate exception table users off module.h and onto extable.h · 744c193e
      Paul Gortmaker 提交于
      These files were only including module.h for exception table related
      functions.  We've now separated that content out into its own file
      "extable.h" so now move over to that and avoid all the extra header content
      in module.h that we don't really need to compile these files.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Link: http://lkml.kernel.org/r/20160919210418.30243-1-paul.gortmaker@windriver.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      744c193e
    • D
      x86/apic: Get rid of apic_version[] array · cff9ab2b
      Denys Vlasenko 提交于
      The array has a size of MAX_LOCAL_APIC, which can be as large as 32k, so it
      can consume up to 128k.
      
      The array has been there forever and was never used for anything useful
      other than a version mismatch check which was introduced in 2009.
      
      There is no reason to store the version in an array. The kernel is not
      prepared to handle different APIC versions anyway, so the real important
      part is to detect a version mismatch and warn about it, which can be done
      with a single variable as well.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: Andy Lutomirski <luto@amacapital.net>
      CC: Borislav Petkov <bp@alien8.de>
      CC: Brian Gerst <brgerst@gmail.com>
      CC: Mike Travis <travis@sgi.com>
      Link: http://lkml.kernel.org/r/20160913181232.30815-1-dvlasenk@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      cff9ab2b
    • W
      x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq() · b0f48706
      Wanpeng Li 提交于
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.8.0-rc6+ #5 Not tainted
      -------------------------------
      ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from idle CPU!
      rcu_scheduler_active = 1, debug_locks = 0
      RCU used illegally from extended quiescent state!
      no locks held by swapper/2/0.
      
      stack backtrace:
      CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0-rc6+ #5
      Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
       0000000000000000 ffff8d1bd6003f10 ffffffff94446949 ffff8d1bd4a68000
       0000000000000001 ffff8d1bd6003f40 ffffffff940e9247 ffff8d1bbdfcf3d0
       000000000000080b 0000000000000000 0000000000000000 ffff8d1bd6003f70
      Call Trace:
       <IRQ>  [<ffffffff94446949>] dump_stack+0x99/0xd0
       [<ffffffff940e9247>] lockdep_rcu_suspicious+0xe7/0x120
       [<ffffffff9448e0d5>] do_trace_write_msr+0x135/0x140
       [<ffffffff9406e750>] native_write_msr+0x20/0x30
       [<ffffffff9406503d>] native_apic_msr_eoi_write+0x1d/0x30
       [<ffffffff9405b17e>] smp_trace_call_function_interrupt+0x1e/0x270
       [<ffffffff948cb1d6>] trace_call_function_interrupt+0x96/0xa0
       <EOI>  [<ffffffff947200f4>] ? cpuidle_enter_state+0xe4/0x360
       [<ffffffff947200df>] ? cpuidle_enter_state+0xcf/0x360
       [<ffffffff947203a7>] cpuidle_enter+0x17/0x20
       [<ffffffff940df008>] cpu_startup_entry+0x338/0x4d0
       [<ffffffff9405bfc4>] start_secondary+0x154/0x180
      
      This can be reproduced readily by running ftrace test case of kselftest.
      
      Move the irq_enter() call before ack_APIC_irq(), because irq_enter() tells
      the RCU susbstems to end the extended quiescent state, so that the
      following trace call in ack_APIC_irq() works correctly. The same applies to
      exiting_ack_irq() which calls ack_APIC_irq() after irq_exit().
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/1474198491-3738-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      b0f48706
  8. 16 9月, 2016 2 次提交
    • J
      x86/dumpstack: Remove NULL task pointer convention · 81539169
      Josh Poimboeuf 提交于
      show_stack_log_lvl() and friends allow a NULL pointer for the
      task_struct to indicate the current task.  This creates confusion and
      can cause sneaky bugs.
      
      Instead require the caller to pass 'current' directly.
      
      This only changes the internal workings of the dumpstack code.  The
      dump_trace() and show_stack() interfaces still allow a NULL task
      pointer.  Those interfaces should also probably be fixed as well.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      81539169
    • M
      perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2 · 080fe0b7
      Matt Fleming 提交于
      While the Intel PMU monitors the LLC when perf enables the
      HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor
      L1 instruction cache fetches (0x0080) and instruction cache misses
      (0x0081) on the AMD PMU.
      
      This is extremely confusing when monitoring the same workload across
      Intel and AMD machines, since parameters like,
      
        $ perf stat -e cache-references,cache-misses
      
      measure completely different things.
      
      Instead, make the AMD PMU measure instruction/data cache and TLB fill
      requests to the L2 and instruction/data cache and TLB misses in the L2
      when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled,
      respectively. That way the events measure unified caches on both
      platforms.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      080fe0b7