1. 29 4月, 2011 1 次提交
  2. 27 4月, 2011 1 次提交
  3. 22 4月, 2011 4 次提交
    • P
      perf, x86: Update/fix Intel Nehalem cache events · f4929bd3
      Peter Zijlstra 提交于
      Change the Nehalem cache events to use retired memory instruction counters
      (similar to Westmere), this greatly improves the provided stats.
      
      Using:
      
      main ()
      {
              int i;
      
              for (i = 0; i < 1000000000; i++) {
                      asm("mov (%%rsp), %%rbx;"
                          "mov %%rbx, (%%rsp);" : : : "rbx");
              }
      }
      
      We find:
      
       $ perf stat --repeat 10 -e instructions:u -e l1-dcache-loads:u -e l1-dcache-stores:u ./loop_1b_loads+stores
        Performance counter stats for './loop_1b_loads+stores' (10 runs):
            4,000,081,056 instructions:u           #      0.000 IPC ( +-   0.000% )
            4,999,502,846 l1-dcache-loads:u          ( +-   0.008% )
            1,000,034,832 l1-dcache-stores:u         ( +-   0.000% )
               1.565184942  seconds time elapsed   ( +-   0.005% )
      
      The 5b is surprising - we'd expect 1b:
      
       $ perf stat --repeat 10 -e instructions:u -e r10b:u -e l1-dcache-stores:u ./loop_1b_loads+stores
        Performance counter stats for './loop_1b_loads+stores' (10 runs):
            4,000,081,054 instructions:u           #      0.000 IPC ( +-   0.000% )
            1,000,021,961 r10b:u                     ( +-   0.000% )
            1,000,030,951 l1-dcache-stores:u         ( +-   0.000% )
               1.565055422  seconds time elapsed   ( +-   0.003% )
      
      Which this patch thus fixes.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Lin Ming <ming.m.lin@intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Link: http://lkml.kernel.org/n/tip-q9rtru7b7840tws75xzboapv@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f4929bd3
    • C
      perf, x86: P4 PMU - Don't forget to clear cpuc->active_mask on overflow · 1ea5a6af
      Cyrill Gorcunov 提交于
      It's not enough to simply disable event on overflow the
      cpuc->active_mask should be cleared as well otherwise counter
      may stall in "active" even in real being already disabled (which
      potentially may lead to the situation that user may not use this
      counter further).
      
      Don pointed out that:
      
       " I also noticed this patch fixed some unknown NMIs
         on a P4 when I stressed the box".
      Tested-by: NLin Ming <ming.m.lin@intel.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Link: http://lkml.kernel.org/r/1303398203-2918-3-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1ea5a6af
    • I
      x86, perf event: Turn off unstructured raw event access to offcore registers · b52c55c6
      Ingo Molnar 提交于
      Andi Kleen pointed out that the Intel offcore support patches were merged
      without user-space tool support to the functionality:
      
       |
       | The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
       | user space bits were not. This made it impossible to set the extra mask
       | and actually do the OFFCORE profiling
       |
      
      Andi submitted a preliminary patch for user-space support, as an
      extension to perf's raw event syntax:
      
       |
       | Some raw events -- like the Intel OFFCORE events -- support additional
       | parameters. These can be appended after a ':'.
       |
       | For example on a multi socket Intel Nehalem:
       |
       |    perf stat -e r1b7:20ff -a sleep 1
       |
       | Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
       | that measures any access to DRAM on another socket.
       |
      
      But this kind of usability is absolutely unacceptable - users should not
      be expected to type in magic, CPU and model specific incantations to get
      access to useful hardware functionality.
      
      The proper solution is to expose useful offcore functionality via
      generalized events - that way users do not have to care which specific
      CPU model they are using, they can use the conceptual event and not some
      model specific quirky hexa number.
      
      We already have such generalization in place for CPU cache events,
      and it's all very extensible.
      
      "Offcore" events measure general DRAM access patters along various
      parameters. They are particularly useful in NUMA systems.
      
      We want to support them via generalized DRAM events: either as the
      fourth level of cache (after the last-level cache), or as a separate
      generalization category.
      
      That way user-space support would be very obvious, memory access
      profiling could be done via self-explanatory commands like:
      
        perf record -e dram ./myapp
        perf record -e dram-remote ./myapp
      
      ... to measure DRAM accesses or more expensive cross-node NUMA DRAM
      accesses.
      
      These generalized events would work on all CPUs and architectures that
      have comparable PMU features.
      
      ( Note, these are just examples: actual implementation could have more
        sophistication and more parameter - as long as they center around
        similarly simple usecases. )
      
      Now we do not want to revert *all* of the current offcore bits, as they
      are still somewhat useful for generic last-level-cache events, implemented
      in this commit:
      
        e994d7d2: perf: Fix LLC-* events on Intel Nehalem/Westmere
      
      But we definitely do not yet want to expose the unstructured raw events
      to user-space, until better generalization and usability is implemented
      for these hardware event features.
      
      ( Note: after generalization has been implemented raw offcore events can be
        supported as well: there can always be an odd event that is marginally
        useful but not useful enough to generalize. DRAM profiling is definitely
        *not* such a category so generalization must be done first. )
      
      Furthermore, PERF_TYPE_RAW access to these registers was not intended
      to go upstream without proper support - it was a side-effect of the above
      e994d7d2 commit, not mentioned in the changelog.
      
      As v2.6.39 is nearing release we go for the simplest approach: disable
      the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
      kernel and becomes an ABI.
      
      Once proper structure is implemented for these hardware events and users
      are offered usable solutions we can revisit this issue.
      Reported-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b52c55c6
    • A
      perf: Support Xeon E7's via the Westmere PMU driver · b2508e82
      Andi Kleen 提交于
      There's a new model number public, 47, for Xeon E7 (aka Westmere EX).
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: a.p.zijlstra@chello.nl
      Link: http://lkml.kernel.org/r/1303429715-10202-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b2508e82
  4. 21 4月, 2011 4 次提交
    • D
      [PARISC] set memory ranges in N_NORMAL_MEMORY when onlined · d9b41e0b
      David Rientjes 提交于
      When a DISCONTIGMEM memory range is brought online as a NUMA node, it
      also needs to have its bet set in N_NORMAL_MEMORY.  This is necessary for
      generic kernel code that utilizes N_NORMAL_MEMORY as a subset of N_ONLINE
      for memory savings.
      
      These types of hacks can hopefully be removed once DISCONTIGMEM is either
      removed or abstracted away from CONFIG_NUMA.
      
      Fixes a panic in the slub code which only initializes structures for
      N_NORMAL_MEMORY to save memory:
      
      	Backtrace:
      	 [<000000004021c938>] add_partial+0x28/0x98
      	 [<000000004021faa0>] __slab_free+0x1d0/0x1d8
      	 [<000000004021fd04>] kmem_cache_free+0xc4/0x128
      	 [<000000004033bf9c>] ida_get_new_above+0x21c/0x2c0
      	 [<00000000402a8980>] sysfs_new_dirent+0xd0/0x238
      	 [<00000000402a974c>] create_dir+0x5c/0x168
      	 [<00000000402a9ab0>] sysfs_create_dir+0x98/0x128
      	 [<000000004033d6c4>] kobject_add_internal+0x114/0x258
      	 [<000000004033d9ac>] kobject_add_varg+0x7c/0xa0
      	 [<000000004033df20>] kobject_add+0x50/0x90
      	 [<000000004033dfb4>] kobject_create_and_add+0x54/0xc8
      	 [<00000000407862a0>] cgroup_init+0x138/0x1f0
      	 [<000000004077ce50>] start_kernel+0x5a0/0x840
      	 [<000000004011fa3c>] start_parisc+0xa4/0xb8
      	 [<00000000404bb034>] packet_ioctl+0x16c/0x208
      	 [<000000004049ac30>] ip_mroute_setsockopt+0x260/0xf20
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NJames Bottomley <James.Bottomley@suse.de>
      d9b41e0b
    • D
      x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS · 7a6c6547
      David Rientjes 提交于
      The cpu<->node mappings under CONFIG_DEBUG_PER_CPU_MAPS=y
      when NUMA emulation is enabled is currently broken because it does
      not iterate through every emulated node and bind cpus that have
      affinity to it.
      
      NUMA emulation should bind each cpu to every local node to
      accurately represent the true NUMA topology of the underlying
      machine.
      
      debug_cpumask_set_cpu() needs to be fixed at the same time so
      that the debugging information that it emits shows the new
      cpumask of the node being assigned when the cpu is being added
      or removed.
      
      It can now take responsibility of setting or clearing the cpu
      itself to remove the need for duplicate code.
      
      Also change its last parameter, "enable", to have the correct bool
      type since it can only be true or false.
      
       -v2: Fix the return statements, by Kosaki Motohiro
      Acked-and-Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918470.12634@chino.kir.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      7a6c6547
    • D
      Revert "x86, NUMA: Fix fakenuma boot failure" · 37f8527d
      David Rientjes 提交于
      Andreas Herrmann reported that 7d6b4670 ("x86, NUMA: Fix fakenuma
      boot failure") causes certain physical NUMA topologies (for example
      AMD Magny-Cours) to move sibling cpus to a single node when in reality
      they are in separate domains.
      
      This may result in some nodes being completely void of cpus, which
      doesn't accurately represent the correct topology. The system will
      boot, but will have suboptimal NUMA performance.
      
      This commit was intended as a fix for NUMA emulation, but should
      not cause a regression for real NUMA machines as a side effect.
      
      ( There will be a separate fix for the numa-debug code, which
        will not affect physical topologies. )
      Reported-by: NAndreas Herrmann <herrmann.der.user@googlemail.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918110.12634@chino.kir.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      37f8527d
    • L
      mach-ux500: fix i2c0 device setup regression · cf568c58
      Linus Walleij 提交于
      Adding two sets of I2C devices to the same bus doesn't quite work,
      atleast not anymore. Stash one array and determine how much of it
      shall be added instead.
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      cf568c58
  5. 20 4月, 2011 8 次提交
    • S
      xen: mask_rw_pte: do not apply the early_ioremap checks on x86_32 · ee176455
      Stefano Stabellini 提交于
      The two "is_early_ioremap_ptep" checks in mask_rw_pte are only used on
      x86_64, in fact early_ioremap is not used at all to setup the initial
      pagetable on x86_32.
      Moreover on x86_32 the two checks are wrong because the range
      pgt_buf_start..pgt_buf_end initially should be mapped RW because
      the pages in the range are not pagetable pages yet and haven't been
      cleared yet. Afterwards considering the pgt_buf_start..pgt_buf_end is
      part of the initial mapping, xen_alloc_pte is capable of turning
      the ptes RO when they become pagetable pages.
      
      Fix the issue and improve the readability of the code providing two
      different implementation of mask_rw_pte for x86_32 and x86_64.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ee176455
    • S
      xen: do not create the extra e820 region at an addr lower than 4G · 24bdb0b6
      Stefano Stabellini 提交于
      Do not add the extra e820 region at a physical address lower than 4G
      because it breaks e820_end_of_low_ram_pfn().
      
      It is OK for us to move the xen_extra_mem_start up and down because this
      is the index of the memory that can be ballooned in/out - it is memory
      not available to the kernel during bootup.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      24bdb0b6
    • C
      [S390] kvm-390: Let kernel exit SIE instruction on work · 9ff4cfb3
      Carsten Otte 提交于
      From: Christian Borntraeger <borntraeger@de.ibm.com>
      
      This patch fixes the sie exit on interrupts. The low level
      interrupt handler returns to the PSW address in pt_regs and not
      to the PSW address in the lowcore.
      Without this fix a cpu bound guest might never leave guest state
      since the host interrupt handler would blindly return to the
      SIE instruction, even on need_resched and friends.
      
      Cc: stable@kernel.org
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      9ff4cfb3
    • H
      [S390] pfault: fix token handling · e35c76cd
      Heiko Carstens 提交于
      f6649a7e "[S390] cleanup lowcore access from external interrupts" changed
      handling of external interrupts. Instead of letting the external interrupt
      handlers accessing the per cpu lowcore the entry code of the kernel reads
      already all fields that are necessary and passes them to the handlers.
      The pfault interrupt handler was incorrectly converted. It tries to
      dereference a value which used to be a pointer to a lowcore field. After
      the conversion however it is not anymore the pointer to the field but its
      content. So instead of a dereference only a cast is needed to get the
      task pointer that caused the pfault.
      
      Fixes a NULL pointer dereference and a subsequent kernel crash:
      
      Unable to handle kernel pointer dereference at virtual kernel address (null)
      Oops: 0004 [#1] SMP
      Modules linked in: nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc
                         loop qeth_l3 qeth vmur ccwgroup ext3 jbd mbcache dm_mod
                         dasd_eckd_mod dasd_diag_mod dasd_mod
      CPU: 0 Not tainted 2.6.38-2-s390x #1
      Process cron (pid: 1106, task: 000000001f962f78, ksp: 000000001fa0f9d0)
      Krnl PSW : 0404200180000000 000000000002c03e (pfault_interrupt+0xa2/0x138)
                 R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000001 0000000000000000 0000000000000001
                 000000001f962f78 0000000000518968 0000000090000002 000000001ff03280
                 0000000000000000 000000000064f000 000000001f962f78 0000000000002603
                 0000000006002603 0000000000000000 000000001ff7fe68 000000001ff7fe48
      Krnl Code: 000000000002c036: 5820d010            l       %r2,16(%r13)
                 000000000002c03a: 1832                lr      %r3,%r2
                 000000000002c03c: 1a31                ar      %r3,%r1
                >000000000002c03e: ba23d010            cs      %r2,%r3,16(%r13)
                 000000000002c042: a744fffc            brc     4,2c03a
                 000000000002c046: a7290002            lghi    %r2,2
                 000000000002c04a: e320d0000024        stg     %r2,0(%r13)
                 000000000002c050: 07f0                bcr     15,%r0
      Call Trace:
       ([<000000001f962f78>] 0x1f962f78)
        [<000000000001acda>] do_extint+0xf6/0x138
        [<000000000039b6ca>] ext_no_vtime+0x30/0x34
        [<000000007d706e04>] 0x7d706e04
      Last Breaking-Event-Address:
        [<0000000000000000>] 0x0
      
      For stable maintainers:
      the first kernel which contains this bug is 2.6.37.
      Reported-by: NStephen Powell <zlinuxman@wowway.com>
      Cc: Jonathan Nieder <jrnieder@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      e35c76cd
    • J
      [S390] fix page table walk for changing page attributes · e4c031b4
      Jan Glauber 提交于
      The page table walk for changing page attributes used the wrong
      address for pgd/pud/pmd lookups if the range was bigger than
      a pmd entry. Fix the lookup by using the correct address.
      Signed-off-by: NJan Glauber <jang@linux.vnet.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      e4c031b4
    • J
      [S390] prng: prevent access beyond end of stack · c708c57e
      Jan Glauber 提交于
      While initializing the state of the prng only the first 8 bytes of
      random data where used, the second 8 bytes were read from the memory
      after the stack. If only 64 bytes of the kernel stack are used and
      CONFIG_DEBUG_PAGEALLOC is enabled a kernel panic may occur because of
      the invalid page access. Use the correct multiplicator to stay within
      the random data buffer.
      Signed-off-by: NJan Glauber <jang@linux.vnet.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      c708c57e
    • R
      PM: Add missing syscore_suspend() and syscore_resume() calls · 19234c08
      Rafael J. Wysocki 提交于
      Device suspend/resume infrastructure is used not only by the suspend
      and hibernate code in kernel/power, but also by APM, Xen and the
      kexec jump feature.  However, commit 40dc166c
      (PM / Core: Introduce struct syscore_ops for core subsystems PM)
      failed to add syscore_suspend() and syscore_resume() calls to that
      code, which generally leads to breakage when the features in question
      are used.
      
      To fix this problem, add the missing syscore_suspend() and
      syscore_resume() calls to arch/x86/kernel/apm_32.c, kernel/kexec.c
      and drivers/xen/manage.c.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Acked-by: NIan Campbell <ian.campbell@citrix.com>
      19234c08
    • T
      xtensa: Fixup irq conversion fallout and nmi_count · 2ea4db65
      Thomas Gleixner 提交于
      Some unnamed moron fatfingered the arguments of the irq chip callbacks
      to irq_chip instead of irq_data.
      
      While at it remove the nmi_count() print in arch_show_interrupts()
      which has been broken before the irq conversion already.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      2ea4db65
  6. 19 4月, 2011 5 次提交
  7. 18 4月, 2011 10 次提交
  8. 17 4月, 2011 1 次提交
  9. 16 4月, 2011 2 次提交
    • J
      x86, amd: Disable GartTlbWlkErr when BIOS forgets it · 5bbc097d
      Joerg Roedel 提交于
      This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
      the BIOS forgets to do is (or is just too old). Letting
      these errors enabled can cause a sync-flood on the CPU
      causing a reboot.
      
      The AMD BKDG recommends disabling GART TLB Wlk Error completely.
      
      This patch is the fix for
      
      	https://bugzilla.kernel.org/show_bug.cgi?id=33012
      
      on my machine.
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      Link: http://lkml.kernel.org/r/20110415131152.GJ18463@8bytes.orgTested-by: NAlexandre Demers <alexandre.f.demers@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      5bbc097d
    • K
      x86, NUMA: Fix fakenuma boot failure · 7d6b4670
      KOSAKI Motohiro 提交于
      Currently, numa=fake boot parameter is broken. If it's used,
      kernel may panic due to devide by zero error depending on CPU
      configuration
      
      Call Trace:
       [<ffffffff8104ad4c>] find_busiest_group+0x38c/0xd30
       [<ffffffff81086aff>] ? local_clock+0x6f/0x80
       [<ffffffff81050533>] load_balance+0xa3/0x600
       [<ffffffff81050f53>] idle_balance+0xf3/0x180
       [<ffffffff81550092>] schedule+0x722/0x7d0
       [<ffffffff81550538>] ? wait_for_common+0x128/0x190
       [<ffffffff81550a65>] schedule_timeout+0x265/0x320
       [<ffffffff81095815>] ? lock_release_holdtime+0x35/0x1a0
       [<ffffffff81550538>] ? wait_for_common+0x128/0x190
       [<ffffffff8109bb6c>] ? __lock_release+0x9c/0x1d0
       [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
       [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
       [<ffffffff81550540>] wait_for_common+0x130/0x190
       [<ffffffff81051920>] ? try_to_wake_up+0x510/0x510
       [<ffffffff8155067d>] wait_for_completion+0x1d/0x20
       [<ffffffff8107f36c>] kthread_create_on_node+0xac/0x150
       [<ffffffff81077bb0>] ? process_scheduled_works+0x40/0x40
       [<ffffffff8155045f>] ? wait_for_common+0x4f/0x190
       [<ffffffff8107a283>] __alloc_workqueue_key+0x1a3/0x590
       [<ffffffff81e0cce2>] cpuset_init_smp+0x6b/0x7b
       [<ffffffff81df3d07>] kernel_init+0xc3/0x182
       [<ffffffff8155d5e4>] kernel_thread_helper+0x4/0x10
       [<ffffffff81553cd4>] ? retint_restore_args+0x13/0x13
       [<ffffffff81df3c44>] ? start_kernel+0x400/0x400
       [<ffffffff8155d5e0>] ? gs_change+0x13/0x13
      
      The divede by zero is caused by the following line,
      group->cpu_power==0:
      
       kernel/sched_fair.c::update_sg_lb_stats()
              /* Adjust by relative CPU power of the group */
              sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
      
      This regression was caused by commit e23bba60 ("x86-64, NUMA: Unify
      emulated distance mapping") because it changes cpu -> node
      mapping in the process of dropping fake_physnodes().
      
        old) all cpus are assinged node 0
        now) cpus are assigned round robin
             (the logic is implemented by numa_init_array())
      
        Note: The change in behavior only happens if the system doesn't
              have neither ACPI SRAT table nor AMD northbridge NUMA
      	information.
      
      Round robin assignment doesn't work because init_numa_sched_groups_power()
      assumes all logical cpus in the same physical cpu share the same node
      (then it only accounts for group_first_cpu()), and the simple round robin
      breaks the above assumption.
      
      Thus, this patch implements a reassignment of node-ids if buggy firmware
      or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
      in the same physical cpu share the same node.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/20110415203928.1303.A69D9226@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      7d6b4670
  10. 15 4月, 2011 4 次提交