1. 02 4月, 2015 16 次提交
  2. 27 3月, 2015 6 次提交
    • P
      perf: Add per event clockid support · 34f43927
      Peter Zijlstra 提交于
      While thinking on the whole clock discussion it occurred to me we have
      two distinct uses of time:
      
       1) the tracking of event/ctx/cgroup enabled/running/stopped times
          which includes the self-monitoring support in struct
          perf_event_mmap_page.
      
       2) the actual timestamps visible in the data records.
      
      And we've been conflating them.
      
      The first is all about tracking time deltas, nobody should really care
      in what time base that happens, its all relative information, as long
      as its internally consistent it works.
      
      The second however is what people are worried about when having to
      merge their data with external sources. And here we have the
      discussion on MONOTONIC vs MONOTONIC_RAW etc..
      
      Where MONOTONIC is good for correlating between machines (static
      offset), MONOTNIC_RAW is required for correlating against a fixed rate
      hardware clock.
      
      This means configurability; now 1) makes that hard because it needs to
      be internally consistent across groups of unrelated events; which is
      why we had to have a global perf_clock().
      
      However, for 2) it doesn't really matter, perf itself doesn't care
      what it writes into the buffer.
      
      The below patch makes the distinction between these two cases by
      adding perf_event_clock() which is used for the second case. It
      further makes this configurable on a per-event basis, but adds a few
      sanity checks such that we cannot combine events with different clocks
      in confusing ways.
      
      And since we then have per-event configurability we might as well
      retain the 'legacy' behaviour as a default.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      34f43927
    • D
      perf/x86: Remove redundant calls to perf_pmu_{dis|en}able() · 9332d250
      David Ahern 提交于
      perf_pmu_disable() is called before pmu->add() and perf_pmu_enable() is called
      afterwards. No need to call these inside of x86_pmu_add() as well.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424281543-67335-1-git-send-email-dsahern@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9332d250
    • P
      time: Rename timekeeper::tkr to timekeeper::tkr_mono · 876e7881
      Peter Zijlstra 提交于
      In preparation of adding another tkr field, rename this one to
      tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base,
      since the structure is not specific to CLOCK_MONOTONIC and the mono
      name got added to the tk_read_base instance.
      
      Lots of trivial churn.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      876e7881
    • A
      perf/x86/intel: Add INST_RETIRED.ALL workarounds · 294fe0f5
      Andi Kleen 提交于
      On Broadwell INST_RETIRED.ALL cannot be used with any period
      that doesn't have the lowest 6 bits cleared. And the period
      should not be smaller than 128.
      
      This is erratum BDM11 and BDM55:
      
        http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/5th-gen-core-family-spec-update.pdf
      
      BDM11: When using a period < 100; we may get incorrect PEBS/PMI
      interrupts and/or an invalid counter state.
      BDM55: When bit0-5 of the period are !0 we may get redundant PEBS
      records on overflow.
      
      Add a new callback to enforce this, and set it for Broadwell.
      
      How does this handle the case when an app requests a specific
      period with some of the bottom bits set?
      
      Short answer:
      
      Any useful instruction sampling period needs to be 4-6 orders
      of magnitude larger than 128, as an PMI every 128 instructions
      would instantly overwhelm the system and be throttled.
      So the +-64 error from this is really small compared to the
      period, much smaller than normal system jitter.
      
      Long answer (by Peterz):
      
      IFF we guarantee perf_event_attr::sample_period >= 128.
      
      Suppose we start out with sample_period=192; then we'll set period_left
      to 192, we'll end up with left = 128 (we truncate the lower bits). We
      get an interrupt, find that period_left = 64 (>0 so we return 0 and
      don't get an overflow handler), up that to 128. Then we trigger again,
      at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get
      an overflow). We increment with sample_period so we get left = 128. We
      fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an
      overflow). And on and on.
      
      So while the individual interrupts are 'wrong' we get then with
      interval=256,128 in exactly the right ratio to average out at 192. And
      this works for everything >=128.
      
      So the num_samples*fixed_period thing is still entirely correct +- 127,
      which is good enough I'd say, as you already have that error anyhow.
      
      So no need to 'fix' the tools, al we need to do is refuse to create
      INST_RETIRED:ALL events with sample_period < 128.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [ Updated comments and changelog a bit. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-3-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      294fe0f5
    • A
      perf/x86/intel: Add Broadwell core support · 91f1b705
      Andi Kleen 提交于
      Add Broadwell support for Broadwell to perf.
      
      The basic support is very similar to Haswell. We use the new cache
      event list added for Haswell earlier. The only differences
      are a few bits related to remote nodes. To avoid an extra,
      mostly identical, table these are patched up in the initialization code.
      
      The constraint list has one new event that needs to be handled over Haswell.
      
      Includes code and testing from Kan Liang.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-2-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      91f1b705
    • A
      perf/x86/intel: Add new cache events table for Haswell · 0f1b5ca2
      Andi Kleen 提交于
      Haswell offcore events are quite different from Sandy Bridge.
      Add a new table to handle Haswell properly.
      
      Note that the offcore bits listed in the SDM are not quite correct
      (this is currently being fixed). An uptodate list of bits is
      in the patch.
      
      The basic setup is similar to Sandy Bridge. The prefetch columns
      have been removed, as prefetch counting is not very reliable
      on Haswell. One L1 event that is not in the event list anymore
      has been also removed.
      
      - data reads do not include code reads (comparable to earlier Sandy Bridge tables)
      - data counts include speculative execution (except L1 write, dtlb, bpu)
      - remote node access includes both remote memory, remote cache, remote mmio.
      - prefetches are not included in the counts for consistency
        (different from Sandy Bridge, which includes prefetches in the remote node)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [ Removed the HSM30 comments; we don't have them for SNB/IVB either. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1424225886-18652-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f1b5ca2
  3. 25 3月, 2015 1 次提交
  4. 24 3月, 2015 1 次提交
  5. 23 3月, 2015 2 次提交
  6. 20 3月, 2015 1 次提交
    • R
      Revert "x86/PCI: Refine the way to release PCI IRQ resources" · 9e8ce4b9
      Rafael J. Wysocki 提交于
      Commit b4b55cda (Refine the way to release PCI IRQ resources)
      introduced a regression in the PCI IRQ resource management by causing
      the IRQ resource of a device, established when pci_enabled_device()
      is called on a fully disabled device, to be released when the driver
      is unbound from the device, regardless of the enable_cnt.
      
      This leads to the situation that an ill-behaved driver can now make a
      device unusable to subsequent drivers by an imbalance in their use of
      pci_enable/disable_device().  That is a serious problem for secondary
      drivers like vfio-pci, which are innocent of the transgressions of
      the previous driver.
      
      Since the solution of this problem is not immediate and requires
      further discussion, revert commit b4b55cda and the issue it was
      supposed to address (a bug related to xen-pciback) will be taken
      care of in a different way going forward.
      Reported-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      9e8ce4b9
  7. 18 3月, 2015 1 次提交
  8. 17 3月, 2015 1 次提交
  9. 16 3月, 2015 1 次提交
    • B
      Revert "x86/mm/ASLR: Propagate base load address calculation" · 69797daf
      Borislav Petkov 提交于
      This reverts commit:
      
        f47233c2 ("x86/mm/ASLR: Propagate base load address calculation")
      
      The main reason for the revert is that the new boot flag does not work
      at all currently, and in order to make this work, we need non-trivial
      changes to the x86 boot code which we didn't manage to get done in
      time for merging.
      
      And even if we did, they would've been too risky so instead of
      rushing things and break booting 4.1 on boxes left and right, we
      will be very strict and conservative and will take our time with
      this to fix and test it properly.
      Reported-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Junjie Mao <eternal.n08@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/20150316100628.GD22995@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69797daf
  10. 13 3月, 2015 5 次提交
    • W
      KVM: VMX: Set msr bitmap correctly if vcpu is in guest mode · 670125bd
      Wincy Van 提交于
      In commit 3af18d9c ("KVM: nVMX: Prepare for using hardware MSR bitmap"),
      we are setting MSR_BITMAP in prepare_vmcs02 if we should use hardware. This
      is not enough since the field will be modified by following vmx_set_efer.
      
      Fix this by setting vmx_msr_bitmap_nested in vmx_set_msr_bitmap if vcpu is
      in guest mode.
      Signed-off-by: NWincy Van <fanwenyi0529@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      670125bd
    • O
      x86/fpu: Drop_fpu() should not assume that tsk equals current · f4c36863
      Oleg Nesterov 提交于
      drop_fpu() does clear_used_math() and usually this is correct
      because tsk == current.
      
      However switch_fpu_finish()->restore_fpu_checking() is called before
      __switch_to() updates the "current_task" variable. If it fails,
      we will wrongly clear the PF_USED_MATH flag of the previous task.
      
      So use clear_stopped_child_used_math() instead.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pekka Riikonen <priikone@iki.fi>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Suresh Siddha <sbsiddha@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150309171041.GB11388@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f4c36863
    • O
      x86/fpu: Avoid math_state_restore() without used_math() in __restore_xstate_sig() · a7c80ebc
      Oleg Nesterov 提交于
      math_state_restore() assumes it is called with irqs disabled,
      but this is not true if the caller is __restore_xstate_sig().
      
      This means that if ia32_fxstate == T and __copy_from_user()
      fails, __restore_xstate_sig() returns with irqs disabled too.
      
      This triggers:
      
        BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41
         dump_stack
         ___might_sleep
         ? _raw_spin_unlock_irqrestore
         __might_sleep
         down_read
         ? _raw_spin_unlock_irqrestore
         print_vma_addr
         signal_fault
         sys32_rt_sigreturn
      
      Change __restore_xstate_sig() to call set_used_math()
      unconditionally. This avoids enabling and disabling interrupts
      in math_state_restore(). If copy_from_user() fails, we can
      simply do fpu_finit() by hand.
      
      [ Note: this is only the first step. math_state_restore() should
              not check used_math(), it should set this flag. While
      	init_fpu() should simply die. ]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pekka Riikonen <priikone@iki.fi>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suresh Siddha <sbsiddha@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150307153844.GB25954@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a7c80ebc
    • S
      crypto: aesni - fix memory usage in GCM decryption · ccfe8c3f
      Stephan Mueller 提交于
      The kernel crypto API logic requires the caller to provide the
      length of (ciphertext || authentication tag) as cryptlen for the
      AEAD decryption operation. Thus, the cipher implementation must
      calculate the size of the plaintext output itself and cannot simply use
      cryptlen.
      
      The RFC4106 GCM decryption operation tries to overwrite cryptlen memory
      in req->dst. As the destination buffer for decryption only needs to hold
      the plaintext memory but cryptlen references the input buffer holding
      (ciphertext || authentication tag), the assumption of the destination
      buffer length in RFC4106 GCM operation leads to a too large size. This
      patch simply uses the already calculated plaintext size.
      
      In addition, this patch fixes the offset calculation of the AAD buffer
      pointer: as mentioned before, cryptlen already includes the size of the
      tag. Thus, the tag does not need to be added. With the addition, the AAD
      will be written beyond the already allocated buffer.
      
      Note, this fixes a kernel crash that can be triggered from user space
      via AF_ALG(aead) -- simply use the libkcapi test application
      from [1] and update it to use rfc4106-gcm-aes.
      
      Using [1], the changes were tested using CAVS vectors to demonstrate
      that the crypto operation still delivers the right results.
      
      [1] http://www.chronox.de/libkcapi.html
      
      CC: Tadeusz Struk <tadeusz.struk@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NStephan Mueller <smueller@chronox.de>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      ccfe8c3f
    • P
      kvm: x86: i8259: return initialized data on invalid-size read · c1a6bff2
      Petr Matousek 提交于
      If data is read from PIC with invalid access size, the return data stays
      uninitialized even though success is returned.
      
      Fix this by always initializing the data.
      Signed-off-by: NPetr Matousek <pmatouse@redhat.com>
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      c1a6bff2
  11. 12 3月, 2015 2 次提交
  12. 11 3月, 2015 1 次提交
  13. 10 3月, 2015 1 次提交
  14. 06 3月, 2015 1 次提交
    • J
      x86/vdso: Fix the build on GCC5 · e8932869
      Jiri Slaby 提交于
      On gcc5 the kernel does not link:
      
        ld: .eh_frame_hdr table[4] FDE at 0000000000000648 overlaps table[5] FDE at 0000000000000670.
      
      Because prior GCC versions always emitted NOPs on ALIGN directives, but
      gcc5 started omitting them.
      
      .LSTARTFDEDLSI1 says:
      
              /* HACK: The dwarf2 unwind routines will subtract 1 from the
                 return address to get an address in the middle of the
                 presumed call instruction.  Since we didn't get here via
                 a call, we need to include the nop before the real start
                 to make up for it.  */
              .long .LSTART_sigreturn-1-.     /* PC-relative start address */
      
      But commit 69d0627a ("x86 vDSO: reorder vdso32 code") from 2.6.25
      replaced .org __kernel_vsyscall+32,0x90 by ALIGN right before
      __kernel_sigreturn.
      
      Of course, ALIGN need not generate any NOP in there. Esp. gcc5 collapses
      vclock_gettime.o and int80.o together with no generated NOPs as "ALIGN".
      
      So fix this by adding to that point at least a single NOP and make the
      function ALIGN possibly with more NOPs then.
      
      Kudos for reporting and diagnosing should go to Richard.
      Reported-by: NRichard Biener <rguenther@suse.de>
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: <stable@vger.kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1425543211-12542-1-git-send-email-jslaby@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8932869