1. 18 3月, 2009 1 次提交
  2. 17 3月, 2009 2 次提交
    • L
      Fast TSC calibration: calculate proper frequency error bounds · 9e8912e0
      Linus Torvalds 提交于
      In order for ntpd to correctly synchronize the clocks, the frequency of
      the system clock must not be off by more than 500 ppm (or, put another
      way, 1:2000), or ntpd will end up giving up on trying to synchronize
      properly, and ends up reseting the clock in jumps instead.
      
      The fast TSC PIT calibration sometimes failed this test - it was
      assuming that the PIT reads always took about one microsecond each (2us
      for the two reads to get a 16-bit timer), and that calibrating TSC to
      the PIT over 15ms should thus be sufficient to get much closer than
      500ppm (max 2us error on both sides giving 4us over 15ms: a 270 ppm
      error value).
      
      However, that assumption does not always hold: apparently some hardware
      is either very much slower at reading the PIT registers, or there was
      other noise causing at least one machine to get 700+ ppm errors.
      
      So instead of using a fixed 15ms timing loop, this changes the fast PIT
      calibration to read the TSC delta over the individual PIT timer reads,
      and use the result to calculate the error bars on the PIT read timing
      properly.  We then successfully calibrate the TSC only if the maximum
      error bars fall below 500ppm.
      
      In the process, we also relax the timing to allow up to 25ms for the
      calibration, although it can happen much faster depending on hardware.
      Reported-and-tested-by: NJesper Krogh <jesper@krogh.cc>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e8912e0
    • L
      Fix potential fast PIT TSC calibration startup glitch · a6a80e1d
      Linus Torvalds 提交于
      During bootup, when we reprogram the PIT (programmable interval timer)
      to start counting down from 0xffff in order to use it for the fast TSC
      calibration, we should also make sure to delay a bit afterwards to allow
      the PIT hardware to actually start counting with the new value.
      
      That will happens at the next CLK pulse (1.193182 MHz), so the easiest
      way to do that is to just wait at least one microsecond after
      programming the new PIT counter value.  We do that by just reading the
      counter value back once - which will take about 2us on PC hardware.
      Reported-and-tested-by: Njohn stultz <johnstul@us.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6a80e1d
  3. 12 3月, 2009 1 次提交
  4. 10 3月, 2009 1 次提交
    • D
      Revert "[CPUFREQ] Disable sysfs ui for p4-clockmod." · 129f8ae9
      Dave Jones 提交于
      This reverts commit e088e4c9.
      
      Removing the sysfs interface for p4-clockmod was flagged as a
      regression in bug 12826.
      
      Course of action:
       - Find out the remaining causes of overheating, and fix them
         if possible. ACPI should be doing the right thing automatically.
         If it isn't, we need to fix that.
       - mark p4-clockmod ui as deprecated
       - try again with the removal in six months.
      
      It's not really feasible to printk about the deprecation, because
      it needs to happen at all the sysfs entry points, which means adding
      a lot of strcmp("p4-clockmod".. calls to the core, which.. bleuch.
      Signed-off-by: NDave Jones <davej@redhat.com>
      129f8ae9
  5. 09 3月, 2009 3 次提交
    • R
      lguest: fix for CONFIG_SPARSE_IRQ=y · 6db6a5f3
      Rusty Russell 提交于
      Impact: remove lots of lguest boot WARN_ON() when CONFIG_SPARSE_IRQ=y
      
      We now need to call irq_to_desc_alloc_cpu() before
      set_irq_chip_and_handler_name(), but we can't do that from init_IRQ (no
      kmalloc available).
      
      So do it as we use interrupts instead.  Also means we only alloc for
      irqs we use, which was the intent of CONFIG_SPARSE_IRQ anyway.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@redhat.com>
      6db6a5f3
    • R
      lguest: fix crash 'unhandled trap 13 at <native_read_msr_safe>' · cbd88c8e
      Rusty Russell 提交于
      Impact: fix lguest boot crash on modern Intel machines
      
      The code in early_init_intel does:
      
      	if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) {
      		u64 misc_enable;
      
      		rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable);
      
      And that rdmsr faults (not allowed from non-0 PL).  We can get around
      this by mugging the family ID part of the cpuid.  5 seems like a good
      number.
      
      Of course, this is a hack (how very lguest!).  We could just indicate
      that we don't support MSRs, or implement lguest_rdmst.
      Reported-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Tested-by: NPatrick McHardy <kaber@trash.net>
      cbd88c8e
    • S
      x86 mmiotrace: fix remove_kmmio_fault_pages() · d0fc63f7
      Stuart Bennett 提交于
      Impact: fix race+crash in mmiotrace
      
      The list manipulation in remove_kmmio_fault_pages() was broken. If more
      than one consecutive kmmio_fault_page was re-added during the grace
      period between unregister_kmmio_probe() and remove_kmmio_fault_pages(),
      the list manipulation failed to remove pages from the release list.
      
      After a second grace period the pages get into rcu_free_kmmio_fault_pages()
      and raise a BUG_ON() kernel crash.
      
      The list manipulation is fixed to properly remove pages from the release
      list.
      
      This bug has been present from the very beginning of mmiotrace in the
      mainline kernel. It was introduced in 0fd0e3da ("x86: mmiotrace full
      patch, preview 1");
      
      An urgent fix for Linus. Tested by Stuart (on 32-bit) and Pekka
      (on amd and intel 64-bit systems, nouveau and nvidia proprietary).
      Signed-off-by: NStuart Bennett <stuart@freedesktop.org>
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      LKML-Reference: <20090308202135.34933feb@daedalus.pq.iki.fi>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d0fc63f7
  6. 06 3月, 2009 2 次提交
  7. 05 3月, 2009 4 次提交
    • L
      x86: add Dell XPS710 reboot quirk · dd4124a8
      Leann Ogasawara 提交于
      Dell XPS710 will hang on reboot.  This is resolved by adding a quirk to
      set bios reboot.
      Signed-off-by: NLeann Ogasawara <leann.ogasawara@canonical.com>
      Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
      Cc: "manoj.iyer" <manoj.iyer@canonical.com>
      Cc: <stable@kernel.org>
      LKML-Reference: <1236196380.3231.89.camel@emiko>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dd4124a8
    • D
      x86, math-emu: fix init_fpu for task != current · ab9e1858
      Daniel Glöckner 提交于
      Impact: fix math-emu related crash while using GDB/ptrace
      
      init_fpu() calls finit to initialize a task's xstate, while finit always
      works on the current task. If we use PTRACE_GETFPREGS on another
      process and both processes did not already use floating point, we get
      a null pointer exception in finit.
      
      This patch creates a new function finit_task that takes a task_struct
      parameter. finit becomes a wrapper that simply calls finit_task with
      current. On the plus side this avoids many calls to get_current which
      would each resolve to an inline assembler mov instruction.
      
      An empty finit_task has been added to i387.h to avoid linker errors in
      case the compiler still emits the call in init_fpu when
      CONFIG_MATH_EMULATION is not defined.
      
      The declaration of finit in i387.h has been removed as the remaining
      code using this function gets its prototype from fpu_proto.h.
      Signed-off-by: NDaniel Glöckner <dg@emlix.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Bill Metzenthen <billm@melbpc.org.au>
      LKML-Reference: <E1Lew31-0004il-Fg@mailer.emlix.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ab9e1858
    • H
      x86: EFI: Back efi_ioremap with init_memory_mapping instead of FIX_MAP · dd39ecf5
      Huang Ying 提交于
      Impact: Fix boot failure on EFI system with large runtime memory range
      
      Brian Maly reported that some EFI system with large runtime memory
      range can not boot. Because the FIX_MAP used to map runtime memory
      range is smaller than run time memory range.
      
      This patch fixes this issue by re-implement efi_ioremap() with
      init_memory_mapping().
      Reported-and-tested-by: NBrian Maly <bmaly@redhat.com>
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Cc: Brian Maly <bmaly@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1236135513.6204.306.camel@yhuang-dev.sh.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dd39ecf5
    • B
      x86: fix DMI on EFI · ff0c0874
      Brian Maly 提交于
      Impact: reactivate DMI quirks on EFI hardware
      
      DMI tables are loaded by EFI, so the dmi calls must happen after
      efi_init() and not before.
      
      Currently Apple hardware uses DMI to determine the framebuffer mappings
      for efifb. Without DMI working you also have no video on MacBook Pro.
      
      This patch resolves the DMI issue for EFI hardware (DMI is now properly
      detected at boot), and additionally efifb now loads on Apple hardware
      (i.e. video works).
      Signed-off-by: NBrian Maly <bmaly@redhat>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Cc: ying.huang@intel.com
      LKML-Reference: <49ADEDA3.1030406@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
       arch/x86/kernel/setup.c |    5 +++--
       1 file changed, 3 insertions(+), 2 deletions(-)
      ff0c0874
  8. 03 3月, 2009 4 次提交
    • T
      x86: oprofile: don't set counter width from cpuid on Core2 · 780eef94
      Tim Blechmann 提交于
      Impact: fix stuck NMIs and non-working oprofile on certain CPUs
      
      Resetting the counter width of the performance counters on Intel's
      Core2 CPUs, breaks the delivery of NMIs, when running in x86_64 mode.
      
      This should fix bug #12395:
      
        http://bugzilla.kernel.org/show_bug.cgi?id=12395Signed-off-by: NTim Blechmann <tim@klingt.org>
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      LKML-Reference: <20090303100412.GC10085@erda.amd.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      780eef94
    • Y
      x86: fix init_memory_mapping() to handle small ranges · 0fc59d3a
      Yinghai Lu 提交于
      Impact: fix failed EFI bootup in certain circumstances
      
      Ying Huang found init_memory_mapping() has problem with small ranges
      less than 2M when he tried to direct map the EFI runtime code out of
      max_low_pfn_mapped.
      
      It turns out we never considered that case and didn't check the range...
      Reported-by: NYing Huang <ying.huang@intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Brian Maly <bmaly@redhat.com>
      LKML-Reference: <49ACDDED.1060508@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0fc59d3a
    • R
      x86-64: seccomp: fix 32/64 syscall hole · 5b101740
      Roland McGrath 提交于
      On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
      ljmp, and then use the "syscall" instruction to make a 64-bit system
      call.  A 64-bit process make a 32-bit system call with int $0x80.
      
      In both these cases under CONFIG_SECCOMP=y, secure_computing() will use
      the wrong system call number table.  The fix is simple: test TS_COMPAT
      instead of TIF_IA32.  Here is an example exploit:
      
      	/* test case for seccomp circumvention on x86-64
      
      	   There are two failure modes: compile with -m64 or compile with -m32.
      
      	   The -m64 case is the worst one, because it does "chmod 777 ." (could
      	   be any chmod call).  The -m32 case demonstrates it was able to do
      	   stat(), which can glean information but not harm anything directly.
      
      	   A buggy kernel will let the test do something, print, and exit 1; a
      	   fixed kernel will make it exit with SIGKILL before it does anything.
      	*/
      
      	#define _GNU_SOURCE
      	#include <assert.h>
      	#include <inttypes.h>
      	#include <stdio.h>
      	#include <linux/prctl.h>
      	#include <sys/stat.h>
      	#include <unistd.h>
      	#include <asm/unistd.h>
      
      	int
      	main (int argc, char **argv)
      	{
      	  char buf[100];
      	  static const char dot[] = ".";
      	  long ret;
      	  unsigned st[24];
      
      	  if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0)
      	    perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?");
      
      	#ifdef __x86_64__
      	  assert ((uintptr_t) dot < (1UL << 32));
      	  asm ("int $0x80 # %0 <- %1(%2 %3)"
      	       : "=a" (ret) : "0" (15), "b" (dot), "c" (0777));
      	  ret = snprintf (buf, sizeof buf,
      			  "result %ld (check mode on .!)\n", ret);
      	#elif defined __i386__
      	  asm (".code32\n"
      	       "pushl %%cs\n"
      	       "pushl $2f\n"
      	       "ljmpl $0x33, $1f\n"
      	       ".code64\n"
      	       "1: syscall # %0 <- %1(%2 %3)\n"
      	       "lretl\n"
      	       ".code32\n"
      	       "2:"
      	       : "=a" (ret) : "0" (4), "D" (dot), "S" (&st));
      	  if (ret == 0)
      	    ret = snprintf (buf, sizeof buf,
      			    "stat . -> st_uid=%u\n", st[7]);
      	  else
      	    ret = snprintf (buf, sizeof buf, "result %ld\n", ret);
      	#else
      	# error "not this one"
      	#endif
      
      	  write (1, buf, ret);
      
      	  syscall (__NR_exit, 1);
      	  return 2;
      	}
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      [ I don't know if anybody actually uses seccomp, but it's enabled in
        at least both Fedora and SuSE kernels, so maybe somebody is. - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b101740
    • R
      x86-64: syscall-audit: fix 32/64 syscall hole · ccbe495c
      Roland McGrath 提交于
      On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
      ljmp, and then use the "syscall" instruction to make a 64-bit system
      call.  A 64-bit process make a 32-bit system call with int $0x80.
      
      In both these cases, audit_syscall_entry() will use the wrong system
      call number table and the wrong system call argument registers.  This
      could be used to circumvent a syscall audit configuration that filters
      based on the syscall numbers or argument details.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccbe495c
  9. 02 3月, 2009 7 次提交
    • P
      x86 mmiotrace: fix race with release_kmmio_fault_page() · 340430c5
      Pekka Paalanen 提交于
      There was a theoretical possibility to a race between arming a page in
      post_kmmio_handler() and disarming the page in
      release_kmmio_fault_page():
      
      cpu0                             cpu1
      ------------------------------------------------------------------
      mmiotrace shutdown
      enter release_kmmio_fault_page
                                       fault on the page
                                       disarm the page
      disarm the page
                                       handle the MMIO access
                                       re-arm the page
      put the page on release list
      remove_kmmio_fault_pages()
                                       fault on the page
                                       page not known to mmiotrace
                                       fall back to do_page_fault()
                                       *KABOOM*
      
      (This scenario also shows the double disarm case which is allowed.)
      
      Fixed by acquiring kmmio_lock in post_kmmio_handler() and checking
      if the page is being released from mmiotrace.
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Stuart Bennett <stuart@freedesktop.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      340430c5
    • S
      x86 mmiotrace: improve handling of secondary faults · 3e39aa15
      Stuart Bennett 提交于
      Upgrade some kmmio.c debug messages to warnings.
      Allow secondary faults on probed pages to fall through, and only log
      secondary faults that are not due to non-present pages.
      
      Patch edited by Pekka Paalanen.
      Signed-off-by: NStuart Bennett <stuart@freedesktop.org>
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3e39aa15
    • P
      x86 mmiotrace: split set_page_presence() · 0b700a6a
      Pekka Paalanen 提交于
      From 36772dcb6ffbbb68254cbfc379a103acd2fbfefc Mon Sep 17 00:00:00 2001
      From: Pekka Paalanen <pq@iki.fi>
      Date: Sat, 28 Feb 2009 21:34:59 +0200
      
      Split set_page_presence() in kmmio.c into two more functions set_pmd_presence()
      and set_pte_presence(). Purely code reorganization, no functional changes.
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Stuart Bennett <stuart@freedesktop.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0b700a6a
    • P
      x86 mmiotrace: fix save/restore page table state · 5359b585
      Pekka Paalanen 提交于
      From baa99e2b32449ec7bf147c234adfa444caecac8a Mon Sep 17 00:00:00 2001
      From: Pekka Paalanen <pq@iki.fi>
      Date: Sun, 22 Feb 2009 20:02:43 +0200
      
      Blindly setting _PAGE_PRESENT in disarm_kmmio_fault_page() overlooks the
      possibility, that the page was not present when it was armed.
      
      Make arm_kmmio_fault_page() store the previous page presence in struct
      kmmio_fault_page and use it on disarm.
      
      This patch was originally written by Stuart Bennett, but Pekka Paalanen
      rewrote it a little different.
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Stuart Bennett <stuart@freedesktop.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5359b585
    • S
      x86 mmiotrace: WARN_ONCE if dis/arming a page fails · e9d54cae
      Stuart Bennett 提交于
      Print a full warning once, if arming or disarming a page fails.
      
      Also, if initial arming fails, do not handle the page further. This
      avoids the possibility of a page failing to arm and then later claiming
      to have handled any fault on that page.
      
      WARN_ONCE added by Pekka Paalanen.
      Signed-off-by: NStuart Bennett <stuart@freedesktop.org>
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e9d54cae
    • P
      x86: add far read test to testmmiotrace · 5ff93697
      Pekka Paalanen 提交于
      Apparently pages far into an ioremapped region might not actually be
      mapped during ioremap(). Add an optional read test to try to trigger a
      multiply faulting MMIO access. Also add more messages to the kernel log
      to help debugging.
      
      This patch is based on a patch suggested by
      Stuart Bennett <stuart@freedesktop.org>
      who discovered bugs in mmiotrace related to normal kernel space faults.
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Stuart Bennett <stuart@freedesktop.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5ff93697
    • P
      x86: count errors in testmmiotrace.ko · fab852aa
      Pekka Paalanen 提交于
      Check the read values against the written values in the MMIO read/write
      test. This test shows if the given MMIO test area really works as
      memory, which is a prerequisite for a successful mmiotrace test.
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Stuart Bennett <stuart@freedesktop.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fab852aa
  10. 28 2月, 2009 1 次提交
  11. 27 2月, 2009 1 次提交
  12. 26 2月, 2009 1 次提交
  13. 25 2月, 2009 2 次提交
  14. 23 2月, 2009 2 次提交
  15. 22 2月, 2009 2 次提交
  16. 21 2月, 2009 1 次提交
    • H
      x86, mce: remove incorrect __cpuinit for mce_cpu_features() · cc3ca220
      H. Peter Anvin 提交于
      Impact: Bug fix on UP
      
      Checkin 6ec68bff:
          x86, mce: reinitialize per cpu features on resume
      
      introduced a call to mce_cpu_features() in the resume path, in order
      for the MCE machinery to get properly reinitialized after a resume.
      However, this function (and its successors) was flagged __cpuinit,
      which becomes __init on UP configurations (on SMP suspend/resume
      requires CPU hotplug and so this would not be seen.)
      
      Remove the offending __cpuinit annotations for mce_cpu_features() and
      its successor functions.
      
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      cc3ca220
  17. 20 2月, 2009 2 次提交
    • I
      x86: use the right protections for split-up pagetables · 07a66d7c
      Ingo Molnar 提交于
      Steven Rostedt found a bug in where in his modified kernel
      ftrace was unable to modify the kernel text, due to the PMD
      itself having been marked read-only as well in
      split_large_page().
      
      The fix, suggested by Linus, is to not try to 'clone' the
      reference protection of a huge-page, but to use the standard
      (and permissive) page protection bits of KERNPG_TABLE.
      
      The 'cloning' makes sense for the ptes but it's a confused and
      incorrect concept at the page table level - because the
      pagetable entry is a set of all ptes and hence cannot
      'clone' any single protection attribute - the ptes can be any
      mixture of protections.
      
      With the permissive KERNPG_TABLE, even if the pte protections
      get changed after this point (due to ftrace doing code-patching
      or other similar activities like kprobes), the resulting combined
      protections will still be correct and the pte's restrictive
      (or permissive) protections will control it.
      
      Also update the comment.
      
      This bug was there for a long time but has not caused visible
      problems before as it needs a rather large read-only area to
      trigger. Steve possibly hacked his kernel with some really
      large arrays or so. Anyway, the bug is definitely worth fixing.
      
      [ Huang Ying also experienced problems in this area when writing
        the EFI code, but the real bug in split_large_page() was not
        realized back then. ]
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      07a66d7c
    • A
      x86, vmi: TSC going backwards check in vmi clocksource · 48ffc70b
      Alok N Kataria 提交于
      Impact: fix time warps under vmware
      
      Similar to the check for TSC going backwards in the TSC clocksource,
      we also need this check for VMI clocksource.
      Signed-off-by: NAlok N Kataria <akataria@vmware.com>
      Cc: Zachary Amsden <zach@vmware.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org
      48ffc70b
  18. 19 2月, 2009 1 次提交
    • K
      mm: clean up for early_pfn_to_nid() · f2dbcfa7
      KAMEZAWA Hiroyuki 提交于
      What's happening is that the assertion in mm/page_alloc.c:move_freepages()
      is triggering:
      
      	BUG_ON(page_zone(start_page) != page_zone(end_page));
      
      Once I knew this is what was happening, I added some annotations:
      
      	if (unlikely(page_zone(start_page) != page_zone(end_page))) {
      		printk(KERN_ERR "move_freepages: Bogus zones: "
      		       "start_page[%p] end_page[%p] zone[%p]\n",
      		       start_page, end_page, zone);
      		printk(KERN_ERR "move_freepages: "
      		       "start_zone[%p] end_zone[%p]\n",
      		       page_zone(start_page), page_zone(end_page));
      		printk(KERN_ERR "move_freepages: "
      		       "start_pfn[0x%lx] end_pfn[0x%lx]\n",
      		       page_to_pfn(start_page), page_to_pfn(end_page));
      		printk(KERN_ERR "move_freepages: "
      		       "start_nid[%d] end_nid[%d]\n",
      		       page_to_nid(start_page), page_to_nid(end_page));
       ...
      
      And here's what I got:
      
      	move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
      	move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
      	move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
      	move_freepages: start_nid[1] end_nid[0]
      
      My memory layout on this box is:
      
      [    0.000000] Zone PFN ranges:
      [    0.000000]   Normal   0x00000000 -> 0x0081ff5d
      [    0.000000] Movable zone start PFN for each node
      [    0.000000] early_node_map[8] active PFN ranges
      [    0.000000]     0: 0x00000000 -> 0x00020000
      [    0.000000]     1: 0x00800000 -> 0x0081f7ff
      [    0.000000]     1: 0x0081f800 -> 0x0081fe50
      [    0.000000]     1: 0x0081fed1 -> 0x0081fed8
      [    0.000000]     1: 0x0081feda -> 0x0081fedb
      [    0.000000]     1: 0x0081fedd -> 0x0081fee5
      [    0.000000]     1: 0x0081fee7 -> 0x0081ff51
      [    0.000000]     1: 0x0081ff59 -> 0x0081ff5d
      
      So it's a block move in that 0x81f600-->0x81f7ff region which triggers
      the problem.
      
      This patch:
      
      Declaration of early_pfn_to_nid() is scattered over per-arch include
      files, and it seems it's complicated to know when the declaration is used.
       I think it makes fix-for-memmap-init not easy.
      
      This patch moves all declaration to include/linux/mm.h
      
      After this,
        if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
           -> Use static definition in include/linux/mm.h
        else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
           -> Use generic definition in mm/page_alloc.c
        else
           -> per-arch back end function will be called.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: NDavid Miller <davem@davemlloft.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2dbcfa7
  19. 18 2月, 2009 2 次提交