1. 12 10月, 2010 1 次提交
    • Y
      x86, numa: For each node, register the memory blocks actually used · 73cf624d
      Yinghai Lu 提交于
      Russ reported SGI UV is broken recently. He said:
      
      | The SRAT table shows that memory range is spread over two nodes.
      |
      | SRAT: Node 0 PXM 0 100000000-800000000
      | SRAT: Node 1 PXM 1 800000000-1000000000
      | SRAT: Node 0 PXM 0 1000000000-1080000000
      |
      |Previously, the kernel early_node_map[] would show three entries
      |with the proper node.
      |
      |[    0.000000]     0: 0x00100000 -> 0x00800000
      |[    0.000000]     1: 0x00800000 -> 0x01000000
      |[    0.000000]     0: 0x01000000 -> 0x01080000
      |
      |The problem is recent community kernel early_node_map[] shows
      |only two entries with the node 0 entry overlapping the node 1
      |entry.
      |
      |    0: 0x00100000 -> 0x01080000
      |    1: 0x00800000 -> 0x01000000
      
      After looking at the changelog, Found out that it has been broken for a while by
      following commit
      
      |commit 8716273c
      |Author: David Rientjes <rientjes@google.com>
      |Date:   Fri Sep 25 15:20:04 2009 -0700
      |
      |    x86: Export srat physical topology
      
      Before that commit, register_active_regions() is called for every SRAT memory
      entry right away.
      
      Use nodememblk_range[] instead of nodes[] in order to make sure we
      capture the actual memory blocks registered with each node.  nodes[]
      contains an extended range which spans all memory regions associated
      with a node, but that does not mean that all the memory in between are
      included.
      Reported-by: NRuss Anderson <rja@sgi.com>
      Tested-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4CB27BDF.5000800@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org> 2.6.33 .34 .35 .36
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      73cf624d
  2. 11 10月, 2010 3 次提交
    • Z
      KVM: x86: Move TSC reset out of vmcb_init · 47008cd8
      Zachary Amsden 提交于
      The VMCB is reset whenever we receive a startup IPI, so Linux is setting
      TSC back to zero happens very late in the boot process and destabilizing
      the TSC.  Instead, just set TSC to zero once at VCPU creation time.
      
      Why the separate patch?  So git-bisect is your friend.
      Signed-off-by: NZachary Amsden <zamsden@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      47008cd8
    • Z
      KVM: x86: Fix SVM VMCB reset · 58877679
      Zachary Amsden 提交于
      On reset, VMCB TSC should be set to zero.  Instead, code was setting
      tsc_offset to zero, which passes through the underlying TSC.
      Signed-off-by: NZachary Amsden <zamsden@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      58877679
    • B
      x86, AMD, MCE thresholding: Fix the MCi_MISCj iteration order · 6dcbfe4f
      Borislav Petkov 提交于
      This fixes possible cases of not collecting valid error info in
      the MCE error thresholding groups on F10h hardware.
      
      The current code contains a subtle problem of checking only the
      Valid bit of MSR0000_0413 (which is MC4_MISC0 - DRAM
      thresholding group) in its first iteration and breaking out if
      the bit is cleared.
      
      But (!), this MSR contains an offset value, BlkPtr[31:24], which
      points to the remaining MSRs in this thresholding group which
      might contain valid information too. But if we bail out only
      after we checked the valid bit in the first MSR and not the
      block pointer too, we miss that other information.
      
      The thing is, MC4_MISC0[BlkPtr] is not predicated on
      MCi_STATUS[MiscV] or MC4_MISC0[Valid] and should be checked
      prior to iterating over the MCI_MISCj thresholding group,
      irrespective of the MC4_MISC0[Valid] setting.
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6dcbfe4f
  3. 08 10月, 2010 1 次提交
  4. 06 10月, 2010 1 次提交
    • L
      modules: Fix module_bug_list list corruption race · 5336377d
      Linus Torvalds 提交于
      With all the recent module loading cleanups, we've minimized the code
      that sits under module_mutex, fixing various deadlocks and making it
      possible to do most of the module loading in parallel.
      
      However, that whole conversion totally missed the rather obscure code
      that adds a new module to the list for BUG() handling.  That code was
      doubly obscure because (a) the code itself lives in lib/bugs.c (for
      dubious reasons) and (b) it gets called from the architecture-specific
      "module_finalize()" rather than from generic code.
      
      Calling it from arch-specific code makes no sense what-so-ever to begin
      with, and is now actively wrong since that code isn't protected by the
      module loading lock any more.
      
      So this commit moves the "module_bug_{finalize,cleanup}()" calls away
      from the arch-specific code, and into the generic code - and in the
      process protects it with the module_mutex so that the list operations
      are now safe.
      
      Future fixups:
       - move the module list handling code into kernel/module.c where it
         belongs.
       - get rid of 'module_bug_list' and just use the regular list of modules
         (called 'modules' - imagine that) that we already create and maintain
         for other reasons.
      Reported-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5336377d
  5. 05 10月, 2010 1 次提交
  6. 01 10月, 2010 4 次提交
  7. 30 9月, 2010 1 次提交
  8. 29 9月, 2010 2 次提交
  9. 27 9月, 2010 1 次提交
  10. 25 9月, 2010 1 次提交
  11. 24 9月, 2010 1 次提交
    • R
      perf, x86: Catch spurious interrupts after disabling counters · 63e6be6d
      Robert Richter 提交于
      Some cpus still deliver spurious interrupts after disabling a
      counter. This caused 'undelivered NMI' messages. This patch
      fixes this. Introduced by:
      
        4177c42a: perf, x86: Try to handle unknown nmis with an enabled PMU
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: gorcunov@gmail.com <gorcunov@gmail.com>
      Cc: fweisbec@gmail.com <fweisbec@gmail.com>
      Cc: ying.huang@intel.com <ying.huang@intel.com>
      Cc: ming.m.lin@intel.com <ming.m.lin@intel.com>
      Cc: yinghai@kernel.org <yinghai@kernel.org>
      Cc: andi@firstfloor.org <andi@firstfloor.org>
      Cc: eranian@google.com <eranian@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <20100915162034.GO13563@erda.amd.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      63e6be6d
  12. 23 9月, 2010 5 次提交
    • J
      x86/amd-iommu: Fix rounding-bug in __unmap_single · 04e0463e
      Joerg Roedel 提交于
      In the __unmap_single function the dma_addr is rounded down
      to a page boundary before the dma pages are unmapped. The
      address is later also used to flush the TLB entries for that
      mapping. But without the offset into the dma page the amount
      of pages to flush might be miscalculated in the TLB flushing
      path. This patch fixes this bug by using the original
      address to flush the TLB.
      
      Cc: stable@kernel.org
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      04e0463e
    • J
      x86/amd-iommu: Work around S3 BIOS bug · 4c894f47
      Joerg Roedel 提交于
      This patch adds a workaround for an IOMMU BIOS problem to
      the AMD IOMMU driver. The result of the bug is that the
      IOMMU does not execute commands anymore when the system
      comes out of the S3 state resulting in system failure. The
      bug in the BIOS is that is does not restore certain hardware
      specific registers correctly. This workaround reads out the
      contents of these registers at boot time and restores them
      on resume from S3. The workaround is limited to the specific
      IOMMU chipset where this problem occurs.
      
      Cc: stable@kernel.org
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      4c894f47
    • J
      x86/amd-iommu: Set iommu configuration flags in enable-loop · e9bf5197
      Joerg Roedel 提交于
      This patch moves the setting of the configuration and
      feature flags out out the acpi table parsing path and moves
      it into the iommu-enable path. This is needed to reliably
      fix resume-from-s3.
      
      Cc: stable@kernel.org
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      e9bf5197
    • S
      tracing/x86: Don't use mcount in kvmclock.c · 258af474
      Steven Rostedt 提交于
      The guest can use the paravirt clock in kvmclock.c which is used
      by sched_clock(), which in turn is used by the tracing mechanism
      for timestamps, which leads to infinite recursion.
      
      Disable mcount/tracing for kvmclock.o.
      
      Cc: stable@kernel.org
      Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      258af474
    • J
      tracing/x86: Don't use mcount in pvclock.c · 9ecd4e16
      Jeremy Fitzhardinge 提交于
      When using a paravirt clock, pvclock.c can be used by sched_clock(),
      which in turn is used by the tracing mechanism for timestamps,
      which leads to infinite recursion.
      
      Disable mcount/tracing for pvclock.o.
      
      Cc: stable@kernel.org
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      LKML-Reference: <4C9A9A3F.4040201@goop.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      9ecd4e16
  13. 22 9月, 2010 2 次提交
  14. 21 9月, 2010 2 次提交
  15. 17 9月, 2010 1 次提交
    • F
      x86: Fix instruction breakpoint encoding · 89e45aac
      Frederic Weisbecker 提交于
      Lengths and types of breakpoints are encoded in a half byte
      into CPU registers. However when we extract these values
      and store them, we add a high half byte part to them: 0x40 to the
      length and 0x80 to the type.
      When that gets reloaded to the CPU registers, the high part
      is masked.
      
      While making the instruction breakpoints available for perf,
      I zapped that high part on instruction breakpoint encoding
      and that broke the arch -> generic translation used by ptrace
      instruction breakpoints. Writing dr7 to set an inst breakpoint
      was then failing.
      
      There is no apparent reason for these high parts so we could get
      rid of them altogether. That's an invasive change though so let's
      do that later and for now fix the problem by restoring that inst
      breakpoint high part encoding in this sole patch.
      Reported-by: NKelvie Wong <kelvie@ieee.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      89e45aac
  16. 16 9月, 2010 1 次提交
  17. 15 9月, 2010 4 次提交
    • R
      x86-64, compat: Retruncate rax after ia32 syscall entry tracing · eefdca04
      Roland McGrath 提交于
      In commit d4d67150, we reopened an old hole for a 64-bit ptracer touching a
      32-bit tracee in system call entry.  A %rax value set via ptrace at the
      entry tracing stop gets used whole as a 32-bit syscall number, while we
      only check the low 32 bits for validity.
      
      Fix it by truncating %rax back to 32 bits after syscall_trace_enter,
      in addition to testing the full 64 bits as has already been added.
      Reported-by: NBen Hawkes <hawkes@sota.gen.nz>
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      eefdca04
    • H
      x86-64, compat: Test %rax for the syscall number, not %eax · 36d001c7
      H. Peter Anvin 提交于
      On 64 bits, we always, by necessity, jump through the system call
      table via %rax.  For 32-bit system calls, in theory the system call
      number is stored in %eax, and the code was testing %eax for a valid
      system call number.  At one point we loaded the stored value back from
      the stack to enforce zero-extension, but that was removed in checkin
      d4d67150.  An actual 32-bit process
      will not be able to introduce a non-zero-extended number, but it can
      happen via ptrace.
      
      Instead of re-introducing the zero-extension, test what we are
      actually going to use, i.e. %rax.  This only adds a handful of REX
      prefixes to the code.
      Reported-by: NBen Hawkes <hawkes@sota.gen.nz>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@kernel.org>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      36d001c7
    • H
      compat: Make compat_alloc_user_space() incorporate the access_ok() · c41d68a5
      H. Peter Anvin 提交于
      compat_alloc_user_space() expects the caller to independently call
      access_ok() to verify the returned area.  A missing call could
      introduce problems on some architectures.
      
      This patch incorporates the access_ok() check into
      compat_alloc_user_space() and also adds a sanity check on the length.
      The existing compat_alloc_user_space() implementations are renamed
      arch_compat_alloc_user_space() and are used as part of the
      implementation of the new global function.
      
      This patch assumes NULL will cause __get_user()/__put_user() to either
      fail or access userspace on all architectures.  This should be
      followed by checking the return value of compat_access_user_space()
      for NULL in the callers, at which time the access_ok() in the callers
      can also be removed.
      Reported-by: NBen Hawkes <hawkes@sota.gen.nz>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NTony Luck <tony.luck@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <jejb@parisc-linux.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: <stable@kernel.org>
      c41d68a5
    • T
      x86: hpet: Work around hardware stupidity · 54ff7e59
      Thomas Gleixner 提交于
      This more or less reverts commits 08be9796 (x86: Force HPET
      readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict
      read back to affected ATI chipsets) to the status of commit 8da854cb
      (x86, hpet: Erratum workaround for read after write of HPET
      comparator).
      
      The delta to commit 8da854cb is mostly comments and the change from
      WARN_ONCE to printk_once as we know the call path of this function
      already.
      
      This needs really in depth explanation:
      
      First of all the HPET design is a complete failure. Having a counter
      compare register which generates an interrupt on matching values
      forces the software to do at least one superfluous readback of the
      counter register.
      
      While it is nice in theory to program "absolute" time events it is
      practically useless because the timer runs at some absurd frequency
      which can never be matched to real world units. So we are forced to
      calculate a relative delta and this forces a readout of the actual
      counter value, adding the delta and programming the compare
      register. When the delta is small enough we run into the danger that
      we program a compare value which is already in the past. Due to the
      compare for equal nature of HPET we need to read back the counter
      value after writing the compare rehgister (btw. this is necessary for
      absolute timeouts as well) to make sure that we did not miss the timer
      event. We try to work around that by setting the minimum delta to a
      value which is larger than the theoretical time which elapses between
      the counter readout and the compare register write, but that's only
      true in theory. A NMI or SMI which hits between the readout and the
      write can easily push us beyond that limit. This would result in
      waiting for the next HPET timer interrupt until the 32bit wraparound
      of the counter happens which takes about 306 seconds.
      
      So we designed the next event function to look like:
      
         match = read_cnt() + delta;
         write_compare_ref(match);
         return read_cnt() < match ? 0 : -ETIME;
      
      At some point we got into trouble with certain ATI chipsets. Even the
      above "safe" procedure failed. The reason was that the write to the
      compare register was delayed probably for performance reasons. The
      theory was that they wanted to avoid the synchronization of the write
      with the HPET clock, which is understandable. So the write does not
      hit the compare register directly instead it goes to some intermediate
      register which is copied to the real compare register in sync with the
      HPET clock. That opens another window for hitting the dreaded "wait
      for a wraparound" problem.
      
      To work around that "optimization" we added a read back of the compare
      register which either enforced the update of the just written value or
      just delayed the readout of the counter enough to avoid the issue. We
      unfortunately never got any affirmative info from ATI/AMD about this.
      
      One thing is sure, that we nuked the performance "optimization" that
      way completely and I'm pretty sure that the result is worse than
      before some HW folks came up with those.
      
      Just for paranoia reasons I added a check whether the read back
      compare register value was the same as the value we wrote right
      before. That paranoia check triggered a couple of years after it was
      added on an Intel ICH9 chipset. Venki added a workaround (commit
      8da854cb) which was reading the compare register twice when the first
      check failed. We considered this to be a penalty in general and
      restricted the readback (thus the wasted CPU cycles) to the known to
      be affected ATI chipsets.
      
      This turned out to be a utterly wrong decision. 2.6.35 testers
      experienced massive problems and finally one of them bisected it down
      to commit 30a564be which spured some further investigation.
      
      Finally we got confirmation that the write to the compare register can
      be delayed by up to two HPET clock cycles which explains the problems
      nicely. All we can do about this is to go back to Venki's initial
      workaround in a slightly modified version.
      
      Just for the record I need to say, that all of this could have been
      avoided if hardware designers and of course the HPET committee would
      have thought about the consequences for a split second. It's out of my
      comprehension why designing a working timer is so hard. There are two
      ways to achieve it:
      
       1) Use a counter wrap around aware compare_reg <= counter_reg
          implementation instead of the easy compare_reg == counter_reg
      
          Downsides:
      
      	- It needs more silicon.
      
      	- It needs a readout of the counter to apply a relative
      	  timeout. This is necessary as the counter does not run in
      	  any useful (and adjustable) frequency and there is no
      	  guarantee that the counter which is used for timer events is
      	  the same which is used for reading the actual time (and
      	  therefor for calculating the delta)
      
          Upsides:
      
      	- None
      
        2) Use a simple down counter for relative timer events
      
          Downsides:
      
      	- Absolute timeouts are not possible, which is not a problem
      	  at all in the context of an OS and the expected
      	  max. latencies/jitter (also see Downsides of #1)
      
         Upsides:
      
      	- It needs less or equal silicon.
      
      	- It works ALWAYS
      
      	- It is way faster than a compare register based solution (One
      	  write versus one write plus at least one and up to four
      	  reads)
      
      I would not be so grumpy about all of this, if I would not have been
      ignored for many years when pointing out these flaws to various
      hardware folks. I really hate timers (at least those which seem to be
      designed by janitors).
      
      Though finally we got a reasonable explanation plus a solution and I
      want to thank all the folks involved in chasing it down and providing
      valuable input to this.
      Bisected-by: NNix <nix@esperi.org.uk>
      Reported-by: NArtur Skawina <art.08.09@gmail.com>
      Reported-by: NDamien Wyart <damien.wyart@free.fr>
      Reported-by: NJohn Drescher <drescherjm@gmail.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: stable@kernel.org
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      54ff7e59
  18. 14 9月, 2010 2 次提交
  19. 11 9月, 2010 2 次提交
  20. 10 9月, 2010 1 次提交
  21. 09 9月, 2010 3 次提交