1. 22 7月, 2007 2 次提交
    • T
      x86_64: mcelog tolerant level cleanup · bd78432c
      Tim Hockin 提交于
      Background:
       The MCE handler has several paths that it can take, depending on various
       conditions of the MCE status and the value of the 'tolerant' knob.  The
       exact semantics are not well defined and the code is a bit twisty.
      
      Description:
       This patch makes the MCE handler's behavior more clear by documenting the
       behavior for various 'tolerant' levels.  It also fixes or enhances
       several small things in the handler.  Specifically:
           * If RIPV is set it is not safe to restart, so set the 'no way out'
             flag rather than the 'kill it' flag.
           * Don't panic() on correctable MCEs.
           * If the _OVER bit is set *and* the _UC bit is set (meaning possibly
             dropped uncorrected errors), set the 'no way out' flag.
           * Use EIPV for testing whether an app can be killed (SIGBUS) rather
             than RIPV.  According to docs, EIPV indicates that the error is
             related to the IP, while RIPV simply means the IP is valid to
             restart from.
           * Don't clear the MCi_STATUS registers until after the panic() path.
             This leaves the status bits set after the panic() so clever BIOSes
             can find them (and dumb BIOSes can do nothing).
      
       This patch also calls nonseekable_open() in mce_open (as suggested by akpm).
      
      Result:
       Tolerant levels behave almost identically to how they always have, but
       not it's well defined.  There's a slightly higher chance of panic()ing
       when multiple errors happen (a good thing, IMHO).  If you take an MBE and
       panic(), the error status bits are not cleared.
      
      Alternatives:
       None.
      
      Testing:
       I used software to inject correctable and uncorrectable errors.  With
       tolerant = 3, the system usually survives.  With tolerant = 2, the system
       usually panic()s (PCC) but not always.  With tolerant = 1, the system
       always panic()s.  When the system panic()s, the BIOS is able to detect
       that the cause of death was an MC4.  I was not able to reproduce the
       case of a non-PCC error in userspace, with EIPV, with (tolerant < 3).
       That will be rare at best.
      Signed-off-by: NTim Hockin <thockin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd78432c
    • J
      x86_64: remove unused variable maxcpus · d567b6a9
      Jan Beulich 提交于
      .. and adjust documentation to properly reflect options that are
      x86-64 specific.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d567b6a9
  2. 03 5月, 2007 5 次提交
    • T
      [PATCH] x86-64: Dynamically adjust machine check interval · 8a336b0a
      Tim Hockin 提交于
      Background:
       We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
       especially when we are trying really hard to stress the system out.  The
       current MCE poller uses a static interval which does not care whether it
       has or has not found MCEs recently.
      
      Description:
       This patch makes the MCE poller adjust the polling interval dynamically.
       If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
       MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
       tunable becomes the max polling interval.  The "Machine check events
       logged" printk() is rate limited to the check_interval, which should be
       identical behavior to the old functionality.
      
      Result:
       If you start to take a lot of correctable errors (not exceptions), you
       log them faster and more accurately (less chance of overflowing the MCA
       registers).  If you don't take a lot of errors, you will see no change.
      
      Alternatives:
       I considered simply reducing the polling interval to 10 ms immediately
       and keeping it there as long as we continue to find errors.  This felt a
       bit heavy handed, but does perform significantly better for the default
       check_interval of 5 minutes (we're using a few seconds when testing for
       DRAM errors).  I could be convinced to go with this, if anyone felt it
       was not too aggressive.
      
      Testing:
       I used an error-injecting DIMM to create lots of correctable DRAM errors
       and verified that the polling interval accelerates.  The printk() only
       happens once per check_interval seconds.
      
      Patch:
       This patch is against 2.6.21-rc7.
      Signed-Off-By: NTim Hockin <thockin@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      8a336b0a
    • D
      [PATCH] x86-64: fake numa for cpusets document · 20280195
      David Rientjes 提交于
      Create a document to explain how to use numa=fake in conjunction with cpusets
      for coarse memory resource management.
      
      An attempt to get more awareness and testing for this feature.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      20280195
    • D
      [PATCH] x86-64: fixed size remaining fake nodes · 382591d5
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to split the remaining system
      memory into nodes of fixed size.  Any leftover memory is allocated to a final
      node unless the command-line ends with a comma.
      
      For example:
        numa=fake=2*512,*128	gives two 512M nodes and the remaining system
      			memory is split into nodes of 128M each.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the size of the remaining nodes to be allocated is
      known based on their capacity for resource management.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      382591d5
    • D
      [PATCH] x86-64: split remaining fake nodes equally · 14694d73
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to split the remaining
      system memory into equal-sized nodes.
      
      For example:
      numa=fake=2*512,4*	gives two 512M nodes and the remaining system
      			memory is split into four approximately equal
      			chunks.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the granularity with which nodes shall be allocated
      is known.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      14694d73
    • D
      [PATCH] x86-64: configurable fake numa node sizes · 8b8ca80e
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to allow for configurable
      node sizes.  These nodes can be used in conjunction with cpusets for coarse
      memory resource management.
      
      The old command-line option is still supported:
        numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
      		actual machine.
      
      But now you may configure your system for the node sizes of your choice:
        numa=fake=2*512,1024,2*256
      		gives two 512M nodes, one 1024M node, two 256M nodes, and
      		the rest of system memory to a sixth node.
      
      The existing hash function is maintained to support the various node sizes
      that are possible with this implementation.
      
      Each node of the same size receives roughly the same amount of available
      pages, regardless of any reserved memory with its address range.  The total
      available pages on the system is calculated and divided by the number of equal
      nodes to allocate.  These nodes are then dynamically allocated and their
      borders extended until such time as their number of available pages reaches
      the required size.
      
      Configurable node sizes are recommended when used in conjunction with cpusets
      for memory control because it eliminates the overhead associated with scanning
      the zonelists of many smaller full nodes on page_alloc().
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8b8ca80e
  3. 24 4月, 2007 1 次提交
    • A
      [PATCH] x86: Remove noreplacement option · 9ce883be
      Andi Kleen 提交于
      noreplacement is dangerous on modern systems because it will not replace the
      context switch FNSAVE with SSE aware FXSAVE. But other places in the kernel still assume
      SSE and do FXSAVE and the CPU will then access FXSAVE information with
      FNSAVE and cause corruption.
      
      Easiest way to avoid this is to remove the option. It was mostly for paranoia
      reasons anyways and alternative()s have been stable for some time.
      
      Thanks to Jeremy F. for reporting and helping debug it.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      9ce883be
  4. 13 2月, 2007 3 次提交
  5. 09 1月, 2007 1 次提交
  6. 07 12月, 2006 2 次提交
  7. 04 10月, 2006 1 次提交
  8. 30 9月, 2006 2 次提交
  9. 26 9月, 2006 2 次提交
  10. 29 7月, 2006 1 次提交
  11. 27 6月, 2006 1 次提交
    • J
      [PATCH] x86_64: Calgary IOMMU - Calgary specific bits · e465058d
      Jon Mason 提交于
      This patch hooks Calgary into the build, the x86-64 IOMMU
      initialization paths, and introduces the Calgary specific bits.  The
      implementation draws inspiration from both PPC (which has support for
      the same chip but requires firmware support which we don't have on
      x86-64) and gart. Calgary is different from gart in that it support a
      translation table per PHB, as opposed to the single gart aperture.
      
      Changes from previous version:
       * Addition of boot-time disablement for bus-level translation/isolation
         (e.g, enable userspace DMA for things like X)
       * Usage of newer IOMMU abstraction functions
      Signed-off-by: NMuli Ben-Yehuda <muli@il.ibm.com>
      Signed-off-by: NJon Mason <jdmason@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e465058d
  12. 10 4月, 2006 1 次提交
    • A
      [PATCH] x86_64: Reserve SRAT hotadd memory on x86-64 · 68a3a7fe
      Andi Kleen 提交于
      From: Keith Mannthey, Andi Kleen
      
      Implement memory hotadd without sparsemem. The memory in the SRAT
      hotadd area is just preserved instead and can be activated later.
      
      There are a few restrictions:
      - Only one continuous hotadd area allowed per node
      
      The main problem is dealing with the many buggy SRAT tables
      that are out there. The strategy here is to reject anything
      suspicious.
      
      Originally from Keith Mannthey, with several hacks and changes by AK
      and also contributions from Andrew Morton
      
      [ TBD: Problems pointed out by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
      
       1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n.
      
          Rebuilding zonelist is necessary when the system has just memory <
          4G at boot, and hot add memory > 4G.  because x86_64 has DMA32,
          ZONE_NORAML is not included into zonelist at boot time if system
          doesn't have memory >4G at boot.
      
          [AK: should just force the higher zones at boot time when SRAT tells us]
      
       2) zone and node's spanned_pages and present_pages are not incremented.
          They should be.
      
          For example, our server (ia64/Fujitsu PrimeQuest) can equip memory
          from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have
          possible 1T +memory.  (Microsoft requires "write all possible memory
          in SRAT") When we reserve memmap for possible 1T memory, Linux will
          not work well in +minimum 4G configuraion ;)
      
          [AK: needs limiting to 5-10% of max memory]
       ]
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      68a3a7fe
  13. 27 2月, 2006 1 次提交
    • A
      [PATCH] x86_64: Better ATI timer fix · ab9b32ee
      Andi Kleen 提交于
      The previous experiment for using apicmaintimer on ATI systems didn't
      work out very well.  In particular laptops with C2/C3 support often
      don't let it tick during idle, which makes it useless.  There were also
      some other bugs that made the apicmaintimer often not used at all.
      
      I tried some other experiments - running timer over RTC and some other
      things but they didn't really work well neither.
      
      I rechecked the specs now and it turns out this simple change is
      actually enough to avoid the double ticks on the ATI systems.  We just
      turn off IRQ 0 in the 8254 and only route it directly using the IO-APIC.
      
      I tested it on a few ATI systems and it worked there.  In fact it worked
      on all chipsets (NVidia, Intel, AMD, ATI) I tried it on.
      
      According to the ACPI spec routing should always work through the
      IO-APIC so I think it's the correct thing to do anyways (and most of the
      old gunk in check_timer should be thrown away for x86-64).
      
      But for 2.6.16 it's best to do a fairly minimal change:
       - Use the known to be working everywhere-but-ATI IRQ0 both over 8254
         and IO-APIC setup everywhere
       - Except on ATI disable IRQ0 in the 8254
       - Remove the code to select apicmaintimer on ATI chipsets
       - Add some boot options to allow to override this (just paranoia)
      
      In 2.6.17 I hope to switch the default over to this for everybody.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ab9b32ee
  14. 05 2月, 2006 2 次提交
    • A
      [PATCH] x86_64: Calibrate APIC timer using PM timer · 0c3749c4
      Andi Kleen 提交于
      On some broken motherboards (at least one NForce3 based AMD64 laptop)
      the PIT timer runs at a incorrect frequency.  This patch adds a new
      option "apicpmtimer" that allows to use the APIC timer and calibrate it
      using the PMTimer.  It requires the earlier patch that allows to run the
      main timer from the APIC.
      
      Specifying apicpmtimer implies apicmaintimer.
      
      The option defaults to off for now.
      
      I tested it on a few systems and the resulting APIC timer frequencies
      were usually a bit off, but always <1%, which should be tolerable.
      
      TBD figure out heuristic to enable this automatically on the affected
      systems TBD perhaps do it on all NForce3s or using DMI?
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0c3749c4
    • A
      [PATCH] x86_64: Allow to run main time keeping from the local APIC interrupt · 73dea47f
      Andi Kleen 提交于
      Another piece from the no-idle-tick patch.
      
      This can be enabled with the "apicmaintimer" option.
      
      This is mainly useful when the PIT/HPET interrupt is unreliable.
      Note there are some systems that are known to stop the APIC
      timer in C3. For those it will never work, but this case
      should be automatically detected.
      
      It also only works with PM timer right now. When HPET is used
      the way the main timer handler computes the delay doesn't work.
      
      It should be a bit more efficient because there is one less
      regular interrupt to process on the boot processor.
      
      Requires earlier bugfix from Venkatesh
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      73dea47f
  15. 15 1月, 2006 1 次提交
  16. 12 1月, 2006 2 次提交
  17. 15 11月, 2005 4 次提交
  18. 13 9月, 2005 1 次提交
  19. 08 8月, 2005 1 次提交
  20. 29 7月, 2005 1 次提交
  21. 21 5月, 2005 1 次提交
  22. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4