1. 19 3月, 2014 4 次提交
  2. 14 3月, 2014 3 次提交
  3. 08 3月, 2014 6 次提交
  4. 07 3月, 2014 3 次提交
    • A
      powerpc: Align p_dyn, p_rela and p_st symbols · a5b2cf5b
      Anton Blanchard 提交于
      The 64bit relocation code places a few symbols in the text segment.
      These symbols are only 4 byte aligned where they need to be 8 byte
      aligned. Add an explicit alignment.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: stable@vger.kernel.org
      Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a5b2cf5b
    • M
      powerpc/tm: Fix crash when forking inside a transaction · 621b5060
      Michael Neuling 提交于
      When we fork/clone we currently don't copy any of the TM state to the new
      thread.  This results in a TM bad thing (program check) when the new process is
      switched in as the kernel does a tmrechkpt with TEXASR FS not set.  Also, since
      R1 is from userspace, we trigger the bad kernel stack pointer detection.  So we
      end up with something like this:
      
         Bad kernel stack pointer 0 at c0000000000404fc
         cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40]
             pc: c0000000000404fc: restore_gprs+0xc0/0x148
             lr: 0000000000000000
             sp: 0
            msr: 9000000100201030
           current = 0xc000001dd1417c30
           paca    = 0xc00000000fe00800   softe: 0        irq_happened: 0x01
             pid   = 0, comm = swapper/2
         WARNING: exception is not recoverable, can't continue
      
      The below fixes this by flushing the TM state before we copy the task_struct to
      the clone.  To do this we go through the tmreclaim patch, which removes the
      checkpointed registers from the CPU and transitions the CPU out of TM suspend
      mode.  Hence we need to call tmrechkpt after to restore the checkpointed state
      and the TM mode for the current task.
      
      To make this fail from userspace is simply:
      	tbegin
      	li	r0, 2
      	sc
      	<boom>
      
      Kudos to Adhemerval Zanella Neto for finding this.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      cc: Adhemerval Zanella Neto <azanella@br.ibm.com>
      cc: stable@vger.kernel.org
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      621b5060
    • P
      x86, trace: Further robustify CR2 handling vs tracing · d4078e23
      Peter Zijlstra 提交于
      Building on commit 0ac09f9f ("x86, trace: Fix CR2 corruption when
      tracing page faults") this patch addresses another few issues:
      
       - Now that read_cr2() is lifted into trace_do_page_fault(), we should
         pass the address to trace_page_fault_entries() to avoid it
         re-reading a potentially changed cr2.
      
       - Put both trace_do_page_fault() and trace_page_fault_entries() under
         CONFIG_TRACING.
      
       - Mark both fault entry functions {,trace_}do_page_fault() as notrace
         to avoid getting __mcount or other function entry trace callbacks
         before we've observed CR2.
      
       - Mark __do_page_fault() as noinline to guarantee the function tracer
         does get to see the fault.
      
      Cc: <jolsa@redhat.com>
      Cc: <vincent.weaver@maine.edu>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140306145300.GO9987@twins.programming.kicks-ass.netSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      d4078e23
  5. 05 3月, 2014 3 次提交
    • J
      x86, trace: Fix CR2 corruption when tracing page faults · 0ac09f9f
      Jiri Olsa 提交于
      The trace_do_page_fault function trigger tracepoint
      and then handles the actual page fault.
      
      This could lead to error if the tracepoint caused page
      fault. The original cr2 value gets lost and the original
      page fault handler kills current process with SIGSEGV.
      
      This happens if you record page faults with callchain
      data, the user part of it will cause tracepoint handler
      to page fault:
      
        # perf record -g -e exceptions:page_fault_user ls
      
      Fixing this by saving the original cr2 value
      and using it after tracepoint handler is done.
      
      v2: Moving the cr2 read before exception_enter, because
          it could trigger tracepoint as well.
      Reported-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1402211701380.6395@vincent-weaver-1.um.maine.edu
      Link: http://lkml.kernel.org/r/20140228160526.GD1133@krava.brq.redhat.com
      0ac09f9f
    • B
      x86/efi: Quirk out SGI UV · a5d90c92
      Borislav Petkov 提交于
      Alex reported hitting the following BUG after the EFI 1:1 virtual
      mapping work was merged,
      
       kernel BUG at arch/x86/mm/init_64.c:351!
       invalid opcode: 0000 [#1] SMP
       Call Trace:
        [<ffffffff818aa71d>] init_extra_mapping_uc+0x13/0x15
        [<ffffffff818a5e20>] uv_system_init+0x22b/0x124b
        [<ffffffff8108b886>] ? clockevents_register_device+0x138/0x13d
        [<ffffffff81028dbb>] ? setup_APIC_timer+0xc5/0xc7
        [<ffffffff8108b620>] ? clockevent_delta2ns+0xb/0xd
        [<ffffffff818a3a92>] ? setup_boot_APIC_clock+0x4a8/0x4b7
        [<ffffffff8153d955>] ? printk+0x72/0x74
        [<ffffffff818a1757>] native_smp_prepare_cpus+0x389/0x3d6
        [<ffffffff818957bc>] kernel_init_freeable+0xb7/0x1fb
        [<ffffffff81535530>] ? rest_init+0x74/0x74
        [<ffffffff81535539>] kernel_init+0x9/0xff
        [<ffffffff81541dfc>] ret_from_fork+0x7c/0xb0
        [<ffffffff81535530>] ? rest_init+0x74/0x74
      
      Getting this thing to work with the new mapping scheme would need more
      work, so automatically switch to the old memmap layout for SGI UV.
      Acked-by: NRuss Anderson <rja@sgi.com>
      Cc: Alex Thorlton <athorlton@sgi.com
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      a5d90c92
    • M
      c6x: fix build failure caused by cache.h · ae72758f
      Mark Salter 提交于
      A patch to linux/irqflags.h uncovered a problem with c6x asm/cache.h
      which causes a build failure:
      
      /arch/c6x/include/asm/cache.h:63:20: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘c6x_cache_init’
       extern void __init c6x_cache_init(void);
      
      The asm/cache.h was relying on linux/irqflags.h to pull in linux/init.h
      but the recent patch changed that. The c6x header should have included
      linux/init.h all along.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      ae72758f
  6. 04 3月, 2014 1 次提交
  7. 03 3月, 2014 2 次提交
    • U
      ARM: XEN depends on having a MMU · 7693decc
      Uwe Kleine-König 提交于
      arch/arm/xen/enlighten.c (and maybe others) use MMU-specific functions
      like pte_mkspecial which are only available on MMU builds. So let XEN
      depend on MMU.
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      7693decc
    • M
      ARM: dts: omap3-gta04: Add ti,omap36xx to compatible property to avoid problems with booting · ae41a303
      Marek Belisko 提交于
      Without that change booting leads to crash with more warnings like below:
      [    0.284454] omap_hwmod: uart4: cannot clk_get main_clk uart4_fck
      [    0.284484] omap_hwmod: uart4: cannot _init_clocks
      [    0.284484] ------------[ cut here ]------------
      [    0.284545] WARNING: CPU: 0 PID: 1 at arch/arm/mach-omap2/omap_hwmod.c:2543 _init+0x300/0x3e4()
      [    0.284545] omap_hwmod: uart4: couldn't init clocks
      [    0.284576] Modules linked in:
      [    0.284606] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-next-20140124-00020-gd2aefec-dirty #26
      [    0.284637] [<c00151c0>] (unwind_backtrace) from [<c0011e20>] (show_stack+0x10/0x14)
      [    0.284667] [<c0011e20>] (show_stack) from [<c0568544>] (dump_stack+0x7c/0x94)
      [    0.284729] [<c0568544>] (dump_stack) from [<c003ff94>] (warn_slowpath_common+0x6c/0x90)
      [    0.284729] [<c003ff94>] (warn_slowpath_common) from [<c003ffe8>] (warn_slowpath_fmt+0x30/0x40)
      [    0.284759] [<c003ffe8>] (warn_slowpath_fmt) from [<c07d1be8>] (_init+0x300/0x3e4)
      [    0.284790] [<c07d1be8>] (_init) from [<c07d217c>] (__omap_hwmod_setup_all+0x40/0x8c)
      [    0.284820] [<c07d217c>] (__omap_hwmod_setup_all) from [<c0008918>] (do_one_initcall+0xe8/0x14c)
      [    0.284851] [<c0008918>] (do_one_initcall) from [<c07c5c18>] (kernel_init_freeable+0x104/0x1c8)
      [    0.284881] [<c07c5c18>] (kernel_init_freeable) from [<c0563524>] (kernel_init+0x8/0x118)
      [    0.284912] [<c0563524>] (kernel_init) from [<c000e368>] (ret_from_fork+0x14/0x2c)
      [    0.285064] ---[ end trace 63de210ad43b627d ]---
      
      Reference:
      https://lkml.org/lkml/2013/10/8/553Signed-off-by: NMarek Belisko <marek@goldelico.com>
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      ae41a303
  8. 01 3月, 2014 3 次提交
  9. 28 2月, 2014 13 次提交
    • S
      arm64: mm: Add double logical invert to pte accessors · 84fe6826
      Steve Capper 提交于
      Page table entries on ARM64 are 64 bits, and some pte functions such as
      pte_dirty return a bitwise-and of a flag with the pte value. If the
      flag to be tested resides in the upper 32 bits of the pte, then we run
      into the danger of the result being dropped if downcast.
      
      For example:
      	gather_stats(page, md, pte_dirty(*pte), 1);
      where pte_dirty(*pte) is downcast to an int.
      
      This patch adds a double logical invert to all the pte_ accessors to
      ensure predictable downcasting.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      84fe6826
    • B
      powerpc/powernv: Fix indirect XSCOM unmangling · e0cf9576
      Benjamin Herrenschmidt 提交于
      We need to unmangle the full address, not just the register
      number, and we also need to support the real indirect bit
      being set for in-kernel uses.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: <stable@vger.kernel.org> [v3.13]
      e0cf9576
    • B
      powerpc/powernv: Fix opal_xscom_{read,write} prototype · 2f3f38e4
      Benjamin Herrenschmidt 提交于
      The OPAL firmware functions opal_xscom_read and opal_xscom_write
      take a 64-bit argument for the XSCOM (PCB) address in order to
      support the indirect mode on P8.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: <stable@vger.kernel.org> [v3.13]
      2f3f38e4
    • G
      powerpc/powernv: Refactor PHB diag-data dump · af87d2fe
      Gavin Shan 提交于
      As Ben suggested, the patch prints PHB diag-data with multiple
      fields in one line and omits the line if the fields of that
      line are all zero.
      
      With the patch applied, the PHB3 diag-data dump looks like:
      
      PHB3 PHB#3 Diag-data (Version: 1)
      
        brdgCtl:     00000002
        RootSts:     0000000f 00400000 b0830008 00100147 00002000
        nFir:        0000000000000000 0030006e00000000 0000000000000000
        PhbSts:      0000001c00000000 0000000000000000
        Lem:         0000000000100000 42498e327f502eae 0000000000000000
        InAErr:      8000000000000000 8000000000000000 0402030000000000 0000000000000000
        PE[  8] A/B: 8480002b00000000 8000000000000000
      
      [ The current diag data is so big that it overflows the printk
        buffer pretty quickly in cases when we get a handful of errors
        at once which can happen. --BenH
      ]
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      af87d2fe
    • G
      powerpc/powernv: Dump PHB diag-data immediately · 94716604
      Gavin Shan 提交于
      The PHB diag-data is important to help locating the root cause for
      EEH errors such as frozen PE or fenced PHB. However, the EEH core
      enables IO path by clearing part of HW registers before collecting
      this data causing it to be corrupted.
      
      This patch fixes this by dumping the PHB diag-data immediately when
      frozen/fenced state on PE or PHB is detected for the first time in
      eeh_ops::get_state() or next_error() backend.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      94716604
    • P
      powerpc: Increase stack redzone for 64-bit userspace to 512 bytes · 573ebfa6
      Paul Mackerras 提交于
      The new ELFv2 little-endian ABI increases the stack redzone -- the
      area below the stack pointer that can be used for storing data --
      from 288 bytes to 512 bytes.  This means that we need to allow more
      space on the user stack when delivering a signal to a 64-bit process.
      
      To make the code a bit clearer, we define new USER_REDZONE_SIZE and
      KERNEL_REDZONE_SIZE symbols in ptrace.h.  For now, we leave the
      kernel redzone size at 288 bytes, since increasing it to 512 bytes
      would increase the size of interrupt stack frames correspondingly.
      
      Gcc currently only makes use of 288 bytes of redzone even when
      compiling for the new little-endian ABI, and the kernel cannot
      currently be compiled with the new ABI anyway.
      
      In the future, hopefully gcc will provide an option to control the
      amount of redzone used, and then we could reduce it even more.
      
      This also changes the code in arch_compat_alloc_user_space() to
      preserve the expanded redzone.  It is not clear why this function would
      ever be used on a 64-bit process, though.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      CC: <stable@vger.kernel.org> [v3.13]
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      573ebfa6
    • L
      powerpc/ftrace: bugfix for test_24bit_addr · a95fc585
      Liu Ping Fan 提交于
      The branch target should be the func addr, not the addr of func_descr_t.
      So using ppc_function_entry() to generate the right target addr.
      Signed-off-by: NLiu Ping Fan <pingfank@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a95fc585
    • L
      powerpc/crashdump : Fix page frame number check in copy_oldmem_page · f5295bd8
      Laurent Dufour 提交于
      In copy_oldmem_page, the current check using max_pfn and min_low_pfn to
      decide if the page is backed or not, is not valid when the memory layout is
      not continuous.
      
      This happens when running as a QEMU/KVM guest, where RTAS is mapped higher
      in the memory. In that case max_pfn points to the end of RTAS, and a hole
      between the end of the kdump kernel and RTAS is not backed by PTEs. As a
      consequence, the kdump kernel is crashing in copy_oldmem_page when accessing
      in a direct way the pages in that hole.
      
      This fix relies on the memblock's service memblock_is_region_memory to
      check if the read page is part or not of the directly accessible memory.
      Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Tested-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f5295bd8
    • T
      powerpc/le: Ensure that the 'stop-self' RTAS token is handled correctly · 41dd03a9
      Tony Breeds 提交于
      Currently we're storing a host endian RTAS token in
      rtas_stop_self_args.token.  We then pass that directly to rtas.  This is
      fine on big endian however on little endian the token is not what we
      expect.
      
      This will typically result in hitting:
      	panic("Alas, I survived.\n");
      
      To fix this we always use the stop-self token in host order and always
      convert it to be32 before passing this to rtas.
      Signed-off-by: NTony Breeds <tony@bakeyournoodle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      41dd03a9
    • T
      ARM: OMAP3: Fix pinctrl interrupts for core2 · 4b416368
      Tony Lindgren 提交于
      After splitting padconf core into two parts to avoid exposing
      unaccessable registers, the new padconf core2 domain was left
      without a wake-up interrupt.
      
      Fix the issue by passing the shared wake-up interrupt in
      platform data like we do for padconf core and wkup domains
      already.
      
      Fixes: 3d495383 (ARM: dts: Split omap3 pinmux core device)
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      4b416368
    • P
      kvm, vmx: Really fix lazy FPU on nested guest · 1b385cbd
      Paolo Bonzini 提交于
      Commit e504c909 (kvm, vmx: Fix lazy FPU on nested guest, 2013-11-13)
      highlighted a real problem, but the fix was subtly wrong.
      
      nested_read_cr0 is the CR0 as read by L2, but here we want to look at
      the CR0 value reflecting L1's setup.  In other words, L2 might think
      that TS=0 (so nested_read_cr0 has the bit clear); but if L1 is actually
      running it with TS=1, we should inject the fault into L1.
      
      The effective value of CR0 in L2 is contained in vmcs12->guest_cr0, use
      it.
      
      Fixes: e504c909Reported-by: NKashyap Chamarty <kchamart@redhat.com>
      Reported-by: NStefan Bader <stefan.bader@canonical.com>
      Tested-by: NKashyap Chamarty <kchamart@redhat.com>
      Tested-by: NAnthoine Bourgeois <bourgeois@bertin.fr>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1b385cbd
    • A
      kvm: x86: fix emulator buffer overflow (CVE-2014-0049) · a08d3b3b
      Andrew Honig 提交于
      The problem occurs when the guest performs a pusha with the stack
      address pointing to an mmio address (or an invalid guest physical
      address) to start with, but then extending into an ordinary guest
      physical address.  When doing repeated emulated pushes
      emulator_read_write sets mmio_needed to 1 on the first one.  On a
      later push when the stack points to regular memory,
      mmio_nr_fragments is set to 0, but mmio_is_needed is not set to 0.
      
      As a result, KVM exits to userspace, and then returns to
      complete_emulated_mmio.  In complete_emulated_mmio
      vcpu->mmio_cur_fragment is incremented.  The termination condition of
      vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments is never achieved.
      The code bounces back and fourth to userspace incrementing
      mmio_cur_fragment past it's buffer.  If the guest does nothing else it
      eventually leads to a a crash on a memcpy from invalid memory address.
      
      However if a guest code can cause the vm to be destroyed in another
      vcpu with excellent timing, then kvm_clear_async_pf_completion_queue
      can be used by the guest to control the data that's pointed to by the
      call to cancel_work_item, which can be used to gain execution.
      
      Fixes: f78146b0Signed-off-by: NAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org (3.5+)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a08d3b3b
    • M
      arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT · b20c9f29
      Marc Zyngier 提交于
      Commit 1fcf7ce0 (arm: kvm: implement CPU PM notifier) added
      support for CPU power-management, using a cpu_notifier to re-init
      KVM on a CPU that entered CPU idle.
      
      The code assumed that a CPU entering idle would actually be powered
      off, loosing its state entierely, and would then need to be
      reinitialized. It turns out that this is not always the case, and
      some HW performs CPU PM without actually killing the core. In this
      case, we try to reinitialize KVM while it is still live. It ends up
      badly, as reported by Andre Przywara (using a Calxeda Midway):
      
      [    3.663897] Kernel panic - not syncing: unexpected prefetch abort in Hyp mode at: 0x685760
      [    3.663897] unexpected data abort in Hyp mode at: 0xc067d150
      [    3.663897] unexpected HVC/SVC trap in Hyp mode at: 0xc0901dd0
      
      The trick here is to detect if we've been through a full re-init or
      not by looking at HVBAR (VBAR_EL2 on arm64). This involves
      implementing the backend for __hyp_get_vectors in the main KVM HYP
      code (rather small), and checking the return value against the
      default one when the CPU notifier is called on CPU_PM_EXIT.
      Reported-by: NAndre Przywara <osp@andrep.de>
      Tested-by: NAndre Przywara <osp@andrep.de>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Rob Herring <rob.herring@linaro.org>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b20c9f29
  10. 27 2月, 2014 2 次提交
    • P
      perf/x86: Fix event scheduling · 26e61e89
      Peter Zijlstra 提交于
      Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures,
      with perf WARN_ON()s triggering. He also provided traces of the failures.
      
      This is I think the relevant bit:
      
      	>    pec_1076_warn-2804  [000] d...   147.926153: x86_pmu_disable: x86_pmu_disable
      	>    pec_1076_warn-2804  [000] d...   147.926153: x86_pmu_state: Events: {
      	>    pec_1076_warn-2804  [000] d...   147.926156: x86_pmu_state:   0: state: .R config: ffffffffffffffff (          (null))
      	>    pec_1076_warn-2804  [000] d...   147.926158: x86_pmu_state:   33: state: AR config: 0 (ffff88011ac99800)
      	>    pec_1076_warn-2804  [000] d...   147.926159: x86_pmu_state: }
      	>    pec_1076_warn-2804  [000] d...   147.926160: x86_pmu_state: n_events: 1, n_added: 0, n_txn: 1
      	>    pec_1076_warn-2804  [000] d...   147.926161: x86_pmu_state: Assignment: {
      	>    pec_1076_warn-2804  [000] d...   147.926162: x86_pmu_state:   0->33 tag: 1 config: 0 (ffff88011ac99800)
      	>    pec_1076_warn-2804  [000] d...   147.926163: x86_pmu_state: }
      	>    pec_1076_warn-2804  [000] d...   147.926166: collect_events: Adding event: 1 (ffff880119ec8800)
      
      So we add the insn:p event (fd[23]).
      
      At this point we should have:
      
        n_events = 2, n_added = 1, n_txn = 1
      
      	>    pec_1076_warn-2804  [000] d...   147.926170: collect_events: Adding event: 0 (ffff8800c9e01800)
      	>    pec_1076_warn-2804  [000] d...   147.926172: collect_events: Adding event: 4 (ffff8800cbab2c00)
      
      We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]).
      These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so
      that's not visible.
      
      	group_sched_in()
      	  pmu->start_txn() /* nop - BP pmu */
      	  event_sched_in()
      	     event->pmu->add()
      
      So here we should end up with:
      
        0: n_events = 3, n_added = 2, n_txn = 2
        4: n_events = 4, n_added = 3, n_txn = 3
      
      But seeing the below state on x86_pmu_enable(), the must have failed,
      because the 0 and 4 events aren't there anymore.
      
      Looking at group_sched_in(), since the BP is the leader, its
      event_sched_in() must have succeeded, for otherwise we would not have
      seen the sibling adds.
      
      But since neither 0 or 4 are in the below state; their event_sched_in()
      must have failed; but I don't see why, the complete state: 0,0,1:p,4
      fits perfectly fine on a core2.
      
      However, since we try and schedule 4 it means the 0 event must have
      succeeded!  Therefore the 4 event must have failed, its failure will
      have put group_sched_in() into the fail path, which will call:
      
      	event_sched_out()
      	  event->pmu->del()
      
      on 0 and the BP event.
      
      Now x86_pmu_del() will reduce n_events; but it will not reduce n_added;
      giving what we see below:
      
       n_event = 2, n_added = 2, n_txn = 2
      
      	>    pec_1076_warn-2804  [000] d...   147.926177: x86_pmu_enable: x86_pmu_enable
      	>    pec_1076_warn-2804  [000] d...   147.926177: x86_pmu_state: Events: {
      	>    pec_1076_warn-2804  [000] d...   147.926179: x86_pmu_state:   0: state: .R config: ffffffffffffffff (          (null))
      	>    pec_1076_warn-2804  [000] d...   147.926181: x86_pmu_state:   33: state: AR config: 0 (ffff88011ac99800)
      	>    pec_1076_warn-2804  [000] d...   147.926182: x86_pmu_state: }
      	>    pec_1076_warn-2804  [000] d...   147.926184: x86_pmu_state: n_events: 2, n_added: 2, n_txn: 2
      	>    pec_1076_warn-2804  [000] d...   147.926184: x86_pmu_state: Assignment: {
      	>    pec_1076_warn-2804  [000] d...   147.926186: x86_pmu_state:   0->33 tag: 1 config: 0 (ffff88011ac99800)
      	>    pec_1076_warn-2804  [000] d...   147.926188: x86_pmu_state:   1->0 tag: 1 config: 1 (ffff880119ec8800)
      	>    pec_1076_warn-2804  [000] d...   147.926188: x86_pmu_state: }
      	>    pec_1076_warn-2804  [000] d...   147.926190: x86_pmu_enable: S0: hwc->idx: 33, hwc->last_cpu: 0, hwc->last_tag: 1 hwc->state: 0
      
      So the problem is that x86_pmu_del(), when called from a
      group_sched_in() that fails (for whatever reason), and without x86_pmu
      TXN support (because the leader is !x86_pmu), will corrupt the n_added
      state.
      Reported-and-Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      26e61e89
    • M
      KVM: MMU: drop read-only large sptes when creating lower level sptes · 404381c5
      Marcelo Tosatti 提交于
      Read-only large sptes can be created due to read-only faults as
      follows:
      
      - QEMU pagetable entry that maps guest memory is read-only
      due to COW.
      - Guest read faults such memory, COW is not broken, because
      it is a read-only fault.
      - Enable dirty logging, large spte not nuked because it is read-only.
      - Write-fault on such memory causes guest to loop endlessly
      (which must go down to level 1 because dirty logging is enabled).
      
      Fix by dropping large spte when necessary.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      404381c5