1. 10 3月, 2011 2 次提交
    • A
      x86/mm: Fix pgd_lock deadlock · a79e53d8
      Andrea Arcangeli 提交于
      It's forbidden to take the page_table_lock with the irq disabled
      or if there's contention the IPIs (for tlb flushes) sent with
      the page_table_lock held will never run leading to a deadlock.
      
      Nobody takes the pgd_lock from irq context so the _irqsave can be
      removed.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <201102162345.p1GNjMjm021738@imap1.linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a79e53d8
    • A
      x86/mm: Handle mm_fault_error() in kernel space · f8626854
      Andrey Vagin 提交于
      mm_fault_error() should not execute oom-killer, if page fault
      occurs in kernel space.  E.g. in copy_from_user()/copy_to_user().
      
      This would happen if we find ourselves in OOM on a
      copy_to_user(), or a copy_from_user() which faults.
      
      Without this patch, the kernels hangs up in copy_from_user(),
      because OOM killer sends SIG_KILL to current process, but it
      can't handle a signal while in syscall, then the kernel returns
      to copy_from_user(), reexcute current command and provokes
      page_fault again.
      
      With this patch the kernel return -EFAULT from copy_from_user().
      
      The code, which checks that page fault occurred in kernel space,
      has been copied from do_sigbus().
      
      This situation is handled by the same way on powerpc, xtensa,
      tile, ...
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <201103092322.p29NMNPH001682@imap1.linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f8626854
  2. 04 3月, 2011 1 次提交
    • Y
      x86, numa: Fix numa_emulation code with memory-less node0 · 3b28cf32
      Yinghai Lu 提交于
      This crash happens on a system that does not have RAM on node0.
      
      When numa_emulation is compiled in, and:
      
       1. we boot the system without numa=fake...
       2. or we boot the system with numa=fake=128 to make emulation fail
      
      we will get:
      
      [    0.076025] ------------[ cut here ]------------
      [    0.080004] kernel BUG at arch/x86/mm/numa_64.c:788!
      [    0.080004] invalid opcode: 0000 [#1] SMP
      [...]
      
      need to use early_cpu_to_node() directly, because cpu_to_apicid
      and apicid_to_node will return node0 that is not onlined.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      LKML-Reference: <4D6ECF72.5010308@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3b28cf32
  3. 03 2月, 2011 1 次提交
  4. 19 1月, 2011 1 次提交
  5. 14 1月, 2011 5 次提交
  6. 07 1月, 2011 1 次提交
    • D
      x86, numa: Fix CONFIG_DEBUG_PER_CPU_MAPS without NUMA emulation · d906f0eb
      David Rientjes 提交于
      "x86, numa: Fake node-to-cpumask for NUMA emulation" broke the
      build when CONFIG_DEBUG_PER_CPU_MAPS is set and CONFIG_NUMA_EMU
      is not.  This is because it is possible to map a cpu to multiple
      nodes when NUMA emulation is used; the patch required a physical
      node address table to find those nodes that was only available
      when CONFIG_NUMA_EMU was enabled.
      
      This extracts the common debug functionality to its own function
      for CONFIG_DEBUG_PER_CPU_MAPS and uses it regardless of whether
      CONFIG_NUMA_EMU is set or not.
      
      NUMA emulation will now iterate over the set of possible nodes
      for each cpu and call the new debug function whereas only the
      cpu's node will be used without NUMA emulation enabled.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <alpine.DEB.2.00.1012301053590.12995@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d906f0eb
  7. 24 12月, 2010 5 次提交
    • D
      x86, numa: Fix cpu to node mapping for sparse node ids · a387e95a
      David Rientjes 提交于
      NUMA boot code assumes that physical node ids start at 0, but the DIMMs
      that the apic id represents may not be reachable.  If this is the case,
      node 0 is never online and cpus never end up getting appropriately
      assigned to a node.  This causes the cpumask of all online nodes to be
      empty and machines crash with kernel code assuming online nodes have
      valid cpus.
      
      The fix is to appropriately map all the address ranges for physical nodes
      and ensure the cpu to node mapping function checks all possible nodes (up
      to MAX_NUMNODES) instead of simply checking nodes 0-N, where N is the
      number of physical nodes, for valid address ranges.
      
      This requires no longer "compressing" the address ranges of nodes in the
      physical node map from 0-N, but rather leave indices in physnodes[] to
      represent the actual node id of the physical node.  Accordingly, the
      topology exported by both amd_get_nodes() and acpi_get_nodes() no longer
      must return the number of nodes to iterate through; all such iterations
      will now be to MAX_NUMNODES.
      
      This change also passes the end address of system RAM (which may be
      different from normal operation if mem= is specified on the command line)
      before the physnodes[] array is populated.  ACPI parsed nodes are
      truncated to fit within the address range that respect the mem=
      boundaries and even some physical nodes may become unreachable in such
      cases.
      
      When NUMA emulation does succeed, any apicid to node mapping that exists
      for unreachable nodes are given default values so that proximity domains
      can still be assigned.  This is important for node_distance() to
      function as desired.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1012221702090.3701@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      a387e95a
    • D
      x86, numa: Fake node-to-cpumask for NUMA emulation · c1c3443c
      David Rientjes 提交于
      It's necessary to fake the node-to-cpumask mapping so that an emulated
      node ID returns a cpumask that includes all cpus that have affinity to
      the memory it represents.
      
      This is a little intrusive because it requires knowledge of the physical
      topology of the system.  setup_physnodes() gives us that information, but
      since NUMA emulation ends up altering the physnodes array, it's necessary
      to reset it before cpus are brought online.
      
      Accordingly, the physnodes array is moved out of init.data and into
      cpuinit.data since it will be needed on cpuup callbacks.
      
      This works regardless of whether numa=fake is used on the command line,
      or the setup of the fake node succeeds or fails.  The physnodes array
      always contains the physical topology of the machine if CONFIG_NUMA_EMU
      is enabled and can be used to setup the correct node-to-cpumask mappings
      in all cases since setup_physnodes() is called whenever the array needs
      to be repopulated with the correct data.
      
      To fake the actual mappings, numa_add_cpu() and numa_remove_cpu() are
      rewritten for CONFIG_NUMA_EMU so that we first find the physical node to
      which each cpu has local affinity, then iterate through all online nodes
      to find the emulated nodes that have local affinity to that physical
      node, and then finally map the cpu to each of those emulated nodes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1012221701520.3701@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c1c3443c
    • D
      x86, numa: Fake apicid and pxm mappings for NUMA emulation · f51bf307
      David Rientjes 提交于
      This patch adds the equivalent of acpi_fake_nodes() for AMD Northbridge
      platforms.  The goal is to fake the apicid-to-node mappings for NUMA
      emulation so the physical topology of the machine is correctly maintained
      within the kernel.
      
      This change also fakes proximity domains for both ACPI and k8 code so the
      physical distance between emulated nodes is maintained via
      node_distance().  This exports the correct distances via
      /sys/devices/system/node/.../distance based on the underlying topology.
      
      A new helper function, fake_physnodes(), is introduced to correctly
      invoke the correct NUMA code to fake these two mappings based on the
      system type.  If there is no underlying NUMA configuration, all cpus are
      mapped to node 0 for local distance.
      
      Since acpi_fake_nodes() is no longer called with CONFIG_ACPI_NUMA, it's
      prototype can be removed from the header file for such a configuration.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1012221701360.3701@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      f51bf307
    • D
      x86, numa: Avoid compiling NUMA emulation functions without CONFIG_NUMA_EMU · 4e76f4e6
      David Rientjes 提交于
      Both acpi_get_nodes() and amd_get_nodes() are only necessary when
      CONFIG_NUMA_EMU is enabled, so avoid compiling them when the option is
      disabled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1012221701210.3701@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      4e76f4e6
    • Y
      x86, acpi: Parse all SRAT cpu entries even above the cpu number limitation · d3bd0588
      Yinghai Lu 提交于
      Recent Intel new system have different order in MADT, aka will list all thread0
      at first, then all thread1.
      But SRAT table still old order, it will list cpus in one socket all together.
      
      If the user have compiled limited NR_CPUS or boot with nr_cpus=, could have missed
      to put some cpus apic id to node mapping into apicid_to_node[].
      
      for example for 4 sockets system with 64 cpus with nr_cpus=32 will get crash...
      
      [    9.106288] Total of 32 processors activated (136190.88 BogoMIPS).
      [    9.235021] divide error: 0000 [#1] SMP
      [    9.235315] last sysfs file:
      [    9.235481] CPU 1
      [    9.235592] Modules linked in:
      [    9.245398]
      [    9.245478] Pid: 2, comm: kthreadd Not tainted 2.6.37-rc1-tip-yh-01782-ge92ef79-dirty #274      /Sun Fire x4800
      [    9.265415] RIP: 0010:[<ffffffff81075a8f>]  [<ffffffff81075a8f>] select_task_rq_fair+0x4f0/0x623
      ...
      [    9.645938] RIP  [<ffffffff81075a8f>] select_task_rq_fair+0x4f0/0x623
      [    9.665356]  RSP <ffff88103f8d1c40>
      [    9.665568] ---[ end trace 2296156d35fdfc87 ]---
      
      So let just parse all cpu entries in SRAT.
      
      Also add apicid checking with MAX_LOCAL_APIC, in case We could out of boundaries of
      apicid_to_node[].
      
      it fixes following bug too.
      https://bugzilla.kernel.org/show_bug.cgi?id=22662
      
      -v2: expand to 32bit according to hpa
         need to add MAX_LOCAL_APIC for 32bit
      Reported-and-Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Reported-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
      Tested-by: NMyron Stowe <myron.stowe@hp.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4D0AD486.9020704@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      d3bd0588
  8. 16 12月, 2010 1 次提交
    • A
      x86, olpc: Add OLPC device-tree support · c10d1e26
      Andres Salomon 提交于
      Make use of PROC_DEVICETREE to export the tree, and sparc's PROMTREE code to
      call into OLPC's Open Firmware to build the tree.
      
      v5: fix buglet with root node check (introduced in v4)
      
      v4: address some minor style issues pointed out by Grant, and explicitly cast
          negative phandle checks to s32.
      
      v3: rename olpc_prom to olpc_dt
        - rework Kconfig entries
        - drop devtree build hook from proc, instead adding a call to x86's
          paging_init (similarly to how sparc64 does it)
        - switch allocation from using slab to alloc_bootmem.  this allows
          the DT to be built earlier during boot (during setup_arch); the
          downside is that there are some 1200 bootmem reservations that are
          done during boot.  Not ideal..
        - add a helper olpc_ofw_is_installed function to test for the
          existence and successful detection of OLPC's OFW.
      Signed-off-by: NAndres Salomon <dilinger@queued.net>
      LKML-Reference: <20101116220952.26526a80@queued.net>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c10d1e26
  9. 10 12月, 2010 1 次提交
  10. 22 11月, 2010 1 次提交
    • L
      x86: Resume trampoline must be executable · 691513f7
      Lin Ming 提交于
      commit 5bd5a452(x86: Add NX protection for kernel data) marked the
      trampoline area NX - which unsurprisingly breaks resume and cpu
      hotplug.
      
      Revert the portion of that commit, which touches the trampoline.
      
      Originally-from: Lin Ming <ming.m.lin@intel.com>
      LKML-Reference: <1290410581.2405.24.camel@minggr.sh.intel.com>
      Cc: Matthieu Castet <castet.matthieu@free.fr>
      Cc: Siarhei Liakh <sliakh.lkml@gmail.com>
      Cc: Xuxian Jiang <jiang@cs.ncsu.edu>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Tested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      691513f7
  11. 18 11月, 2010 5 次提交
    • H
      x86, amd-nb: Complete the rename of AMD NB and related code · eec1d4fa
      Hans Rosenfeld 提交于
      Not only the naming of the files was confusing, it was even more so for
      the function and variable names.
      
      Renamed the K8 NB and NUMA stuff that is also used on other AMD
      platforms. This also renames the CONFIG_K8_NUMA option to
      CONFIG_AMD_NUMA and the related file k8topology_64.c to
      amdtopology_64.c. No functional changes intended.
      Signed-off-by: NHans Rosenfeld <hans.rosenfeld@amd.com>
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      eec1d4fa
    • S
      x86: Eliminate bp argument from the stack tracing routines · 9c0729dc
      Soeren Sandmann Pedersen 提交于
      The various stack tracing routines take a 'bp' argument in which the
      caller is supposed to provide the base pointer to use, or 0 if doesn't
      have one. Since bp is garbage whenever CONFIG_FRAME_POINTER is not
      defined, this means all callers in principle should either always pass
      0, or be conditional on CONFIG_FRAME_POINTER.
      
      However, there are only really three use cases for stack tracing:
      
      (a) Trace the current task, including IRQ stack if any
      (b) Trace the current task, but skip IRQ stack
      (c) Trace some other task
      
      In all cases, if CONFIG_FRAME_POINTER is not defined, bp should just
      be 0.  If it _is_ defined, then
      
      - in case (a) bp should be gotten directly from the CPU's register, so
        the caller should pass NULL for regs,
      
      - in case (b) the caller should should pass the IRQ registers to
        dump_trace(),
      
      - in case (c) bp should be gotten from the top of the task's stack, so
        the caller should pass NULL for regs.
      
      Hence, the bp argument is not necessary because the combination of
      task and regs is sufficient to determine an appropriate value for bp.
      
      This patch introduces a new inline function stack_frame(task, regs)
      that computes the desired bp. This function is then called from the
      two versions of dump_stack().
      Signed-off-by: NSoren Sandmann <ssp@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@infradead.org>,
      Cc: Frederic Weisbecker <fweisbec@gmail.com>,
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>,
      LKML-Reference: <m3oc9rop28.fsf@dhcp-100-3-82.bos.redhat.com>>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      9c0729dc
    • M
      x86: Add NX protection for kernel data · 5bd5a452
      Matthieu Castet 提交于
      This patch expands functionality of CONFIG_DEBUG_RODATA to set main
      (static) kernel data area as NX.
      
      The following steps are taken to achieve this:
      
       1. Linker script is adjusted so .text always starts and ends on a page bound
       2. Linker script is adjusted so .rodata always start and end on a page boundary
       3. NX is set for all pages from _etext through _end in mark_rodata_ro.
       4. free_init_pages() sets released memory NX in arch/x86/mm/init.c
       5. bios rom is set to x when pcibios is used.
      
      The results of patch application may be observed in the diff of kernel page
      table dumps:
      
      pcibios:
      
       -- data_nx_pt_before.txt       2009-10-13 07:48:59.000000000 -0400
       ++ data_nx_pt_after.txt        2009-10-13 07:26:46.000000000 -0400
        0x00000000-0xc0000000           3G                           pmd
        ---[ Kernel Mapping ]---
       -0xc0000000-0xc0100000           1M     RW             GLB x  pte
       +0xc0000000-0xc00a0000         640K     RW             GLB NX pte
       +0xc00a0000-0xc0100000         384K     RW             GLB x  pte
       -0xc0100000-0xc03d7000        2908K     ro             GLB x  pte
       +0xc0100000-0xc0318000        2144K     ro             GLB x  pte
       +0xc0318000-0xc03d7000         764K     ro             GLB NX pte
       -0xc03d7000-0xc0600000        2212K     RW             GLB x  pte
       +0xc03d7000-0xc0600000        2212K     RW             GLB NX pte
        0xc0600000-0xf7a00000         884M     RW         PSE GLB NX pmd
        0xf7a00000-0xf7bfe000        2040K     RW             GLB NX pte
        0xf7bfe000-0xf7c00000           8K                           pte
      
      No pcibios:
      
       -- data_nx_pt_before.txt       2009-10-13 07:48:59.000000000 -0400
       ++ data_nx_pt_after.txt        2009-10-13 07:26:46.000000000 -0400
        0x00000000-0xc0000000           3G                           pmd
        ---[ Kernel Mapping ]---
       -0xc0000000-0xc0100000           1M     RW             GLB x  pte
       +0xc0000000-0xc0100000           1M     RW             GLB NX pte
       -0xc0100000-0xc03d7000        2908K     ro             GLB x  pte
       +0xc0100000-0xc0318000        2144K     ro             GLB x  pte
       +0xc0318000-0xc03d7000         764K     ro             GLB NX pte
       -0xc03d7000-0xc0600000        2212K     RW             GLB x  pte
       +0xc03d7000-0xc0600000        2212K     RW             GLB NX pte
        0xc0600000-0xf7a00000         884M     RW         PSE GLB NX pmd
        0xf7a00000-0xf7bfe000        2040K     RW             GLB NX pte
        0xf7bfe000-0xf7c00000           8K                           pte
      
      The patch has been originally developed for Linux 2.6.34-rc2 x86 by
      Siarhei Liakh <sliakh.lkml@gmail.com> and Xuxian Jiang <jiang@cs.ncsu.edu>.
      
       -v1:  initial patch for 2.6.30
       -v2:  patch for 2.6.31-rc7
       -v3:  moved all code into arch/x86, adjusted credits
       -v4:  fixed ifdef, removed credits from CREDITS
       -v5:  fixed an address calculation bug in mark_nxdata_nx()
       -v6:  added acked-by and PT dump diff to commit log
       -v7:  minor adjustments for -tip
       -v8:  rework with the merge of "Set first MB as RW+NX"
      Signed-off-by: NSiarhei Liakh <sliakh.lkml@gmail.com>
      Signed-off-by: NXuxian Jiang <jiang@cs.ncsu.edu>
      Signed-off-by: NMatthieu CASTET <castet.matthieu@free.fr>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <4CE2F82E.60601@free.fr>
      [ minor cleanliness edits ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5bd5a452
    • M
      x86: Fix improper large page preservation · 64edc8ed
      matthieu castet 提交于
      This patch fixes a bug in try_preserve_large_page() which may
      result in improper large page preservation and improper
      application of page attributes to the memory area outside of the
      original change request.
      
      More specifically, the problem manifests itself when set_memory_*()
      is called for several pages at the beginning of the large page and
      try_preserve_large_page() erroneously concludes that the change can
      be applied to whole large page.
      
      The fix consists of 3 parts:
      
        1. Addition of "required" protection attributes in
           static_protections(), so .data and .bss can be guaranteed to
           stay "RW"
      
        2. static_protections() is now called for every small
           page within large page to determine compatibility of new
           protection attributes (instead of just small pages within the
           requested range).
      
        3. Large page can be preserved only if attribute change is
           large-page-aligned and covers whole large page.
      
       -v1: Try_preserve_large_page() patch for Linux 2.6.34-rc2
       -v2: Replaced pfn check with address check for kernel rw-data
      Signed-off-by: NSiarhei Liakh <sliakh.lkml@gmail.com>
      Signed-off-by: NXuxian Jiang <jiang@cs.ncsu.edu>
      Reviewed-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <4CE2F7F3.8030809@free.fr>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      64edc8ed
    • Y
      x86: Use online node real index in calulate_tbl_offset() · 9223081f
      Yinghai Lu 提交于
      Found a NUMA system that doesn't have RAM installed at the first
      socket which hangs while executing init scripts.
      
      bisected it to:
      
       | commit 93296720
       | Author: Shaohua Li <shaohua.li@intel.com>
       | Date:   Wed Oct 20 11:07:03 2010 +0800
       |
       |     x86: Spread tlb flush vector between nodes
      
      It turns out when first socket is not online it could have cpus on
      node1 tlb_offset set to bigger than NUM_INVALIDATE_TLB_VECTORS.
      
      That could affect systems like 4 sockets, but socket 2 doesn't
      have installed, sockets 3 will get too big tlb_offset.
      
      Need to use real online node idx.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <4CDEDE59.40603@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9223081f
  12. 11 11月, 2010 1 次提交
  13. 01 11月, 2010 1 次提交
    • R
      x86, mm: Fix section mismatch in tlb.c · cf38d0ba
      Rakib Mullick 提交于
      Mark tlb_cpuhp_notify as __cpuinit. It's basically a callback
      function, which is called from __cpuinit init_smp_flash(). So -
      it's safe.
      
      We were warned by the following warning:
      
       WARNING: arch/x86/mm/built-in.o(.text+0x356d): Section mismatch
       in reference from the function tlb_cpuhp_notify() to the
       function .cpuinit.text:calculate_tlb_offset()
       The function tlb_cpuhp_notify() references
       the function __cpuinit calculate_tlb_offset().
       This is often because tlb_cpuhp_notify lacks a __cpuinit
       annotation or the annotation of calculate_tlb_offset is wrong.
      Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      LKML-Reference: <AANLkTinWQRG=HA9uB3ad0KAqRRTinL6L_4iKgF84coph@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cf38d0ba
  14. 29 10月, 2010 1 次提交
  15. 28 10月, 2010 2 次提交
  16. 27 10月, 2010 3 次提交
    • M
      x86: access_error API cleanup · 68da336a
      Michel Lespinasse 提交于
      access_error() already takes error_code as an argument, so there is
      no need for an additional write flag.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68da336a
    • M
      mm: retry page fault when blocking on disk transfer · d065bd81
      Michel Lespinasse 提交于
      This change reduces mmap_sem hold times that are caused by waiting for
      disk transfers when accessing file mapped VMAs.
      
      It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
      site wants mmap_sem to be released if blocking on a pending disk transfer.
      In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
      do_page_fault() will then re-acquire mmap_sem and retry the page fault.
      
      It is expected that the retry will hit the same page which will now be
      cached, and thus it will complete with a low mmap_sem hold time.
      
      Tests:
      
      - microbenchmark: thread A mmaps a large file and does random read accesses
        to the mmaped area - achieves about 55 iterations/s. Thread B does
        mmap/munmap in a loop at a separate location - achieves 55 iterations/s
        before, 15000 iterations/s after.
      
      - We are seeing related effects in some applications in house, which show
        significant performance regressions when running without this change.
      
      [akpm@linux-foundation.org: fix warning & crash]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d065bd81
    • P
      mm: stack based kmap_atomic() · 3e4d3af5
      Peter Zijlstra 提交于
      Keep the current interface but ignore the KM_type and use a stack based
      approach.
      
      The advantage is that we get rid of crappy code like:
      
      	#define __KM_PTE			\
      		(in_nmi() ? KM_NMI_PTE : 	\
      		 in_irq() ? KM_IRQ_PTE :	\
      		 KM_PTE0)
      
      and in general can stop worrying about what context we're in and what kmap
      slots might be appropriate for that.
      
      The downside is that FRV kmap_atomic() gets more expensive.
      
      For now we use a CPP trick suggested by Andrew:
      
        #define kmap_atomic(page, args...) __kmap_atomic(page)
      
      to avoid having to touch all kmap_atomic() users in a single patch.
      
      [ not compiled on:
        - mn10300: the arch doesn't actually build with highmem to begin with ]
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix up drivers/gpu/drm/i915/intel_overlay.c]
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e4d3af5
  17. 21 10月, 2010 3 次提交
    • S
      x86: Spread tlb flush vector between nodes · 93296720
      Shaohua Li 提交于
      Currently flush tlb vector allocation is based on below equation:
      	sender = smp_processor_id() % 8
      This isn't optimal, CPUs from different node can have the same vector, this
      causes a lot of lock contention. Instead, we can assign the same vectors to
      CPUs from the same node, while different node has different vectors. This has
      below advantages:
      a. if there is lock contention, the lock contention is between CPUs from one
      node. This should be much cheaper than the contention between nodes.
      b. completely avoid lock contention between nodes. This especially benefits
      kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
      to specific node.
      
      In my test, this could reduce > 20% CPU overhead in extreme case.The test
      machine has 4 nodes and each node has 16 CPUs. I then bind each node's kswapd
      to the first CPU of the node. I run a workload with 4 sequential mmap file
      read thread. The files are empty sparse file. This workload will trigger a
      lot of page reclaim and tlbflush. The kswapd bind is to easy trigger the
      extreme tlb flush lock contention because otherwise kswapd keeps migrating
      between CPUs of a node and I can't get stable result. Sure in real workload,
      we can't always see so big tlb flush lock contention, but it's possible.
      
      [ hpa: folded in fix from Eric Dumazet to use this_cpu_read() ]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1287544023.4571.8.camel@sli10-conroe.sh.intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      93296720
    • B
      x86-32, mm: Add an initial page table for core bootstrapping · b40827fa
      Borislav Petkov 提交于
      This patch adds an initial page table with low mappings used exclusively
      for booting APs/resuming after ACPI suspend/machine restart. After this,
      there's no need to add low mappings to swapper_pg_dir and zap them later
      or create own swsusp PGD page solely for ACPI sleep needs - we have
      initial_page_table for that.
      Signed-off-by: NBorislav Petkov <bp@alien8.de>
      LKML-Reference: <20101020070526.GA9588@liondog.tnic>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      b40827fa
    • B
      x86, mm: Fix incorrect data type in vmalloc_sync_all() · f01f7c56
      Borislav Petkov 提交于
      arch/x86/mm/fault.c: In function 'vmalloc_sync_all':
      arch/x86/mm/fault.c:238: warning: assignment makes integer from pointer without a cast
      
      introduced by 617d34d9.
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      LKML-Reference: <20101020103642.GA3135@kryptos.osrc.amd.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      f01f7c56
  18. 20 10月, 2010 2 次提交
  19. 15 10月, 2010 1 次提交
    • F
      x86: Barf when vmalloc and kmemcheck faults happen in NMI · ebc8827f
      Frederic Weisbecker 提交于
      In x86, faults exit by executing the iret instruction, which then
      reenables NMIs if we faulted in NMI context. Then if a fault
      happens in NMI, another NMI can nest after the fault exits.
      
      But we don't yet support nested NMIs because we have only one NMI
      stack. To prevent from that, check that vmalloc and kmemcheck
      faults don't happen in this context. Most of the other kernel faults
      in NMIs can be more easily spotted by finding explicit
      copy_from,to_user() calls on review.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      ebc8827f
  20. 14 10月, 2010 1 次提交
    • J
      xen: Cope with unmapped pages when initializing kernel pagetable · fef5ba79
      Jeremy Fitzhardinge 提交于
      Xen requires that all pages containing pagetable entries to be mapped
      read-only.  If pages used for the initial pagetable are already mapped
      then we can change the mapping to RO.  However, if they are initially
      unmapped, we need to make sure that when they are later mapped, they
      are also mapped RO.
      
      We do this by knowing that the kernel pagetable memory is pre-allocated
      in the range e820_table_start - e820_table_end, so any pfn within this
      range should be mapped read-only.  However, the pagetable setup code
      early_ioremaps the pages to write their entries, so we must make sure
      that mappings created in the early_ioremap fixmap area are mapped RW.
      (Those mappings are removed before the pages are presented to Xen
      as pagetable pages.)
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      LKML-Reference: <4CB63A80.8060702@goop.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      fef5ba79
  21. 12 10月, 2010 1 次提交
    • Y
      x86, numa: For each node, register the memory blocks actually used · 73cf624d
      Yinghai Lu 提交于
      Russ reported SGI UV is broken recently. He said:
      
      | The SRAT table shows that memory range is spread over two nodes.
      |
      | SRAT: Node 0 PXM 0 100000000-800000000
      | SRAT: Node 1 PXM 1 800000000-1000000000
      | SRAT: Node 0 PXM 0 1000000000-1080000000
      |
      |Previously, the kernel early_node_map[] would show three entries
      |with the proper node.
      |
      |[    0.000000]     0: 0x00100000 -> 0x00800000
      |[    0.000000]     1: 0x00800000 -> 0x01000000
      |[    0.000000]     0: 0x01000000 -> 0x01080000
      |
      |The problem is recent community kernel early_node_map[] shows
      |only two entries with the node 0 entry overlapping the node 1
      |entry.
      |
      |    0: 0x00100000 -> 0x01080000
      |    1: 0x00800000 -> 0x01000000
      
      After looking at the changelog, Found out that it has been broken for a while by
      following commit
      
      |commit 8716273c
      |Author: David Rientjes <rientjes@google.com>
      |Date:   Fri Sep 25 15:20:04 2009 -0700
      |
      |    x86: Export srat physical topology
      
      Before that commit, register_active_regions() is called for every SRAT memory
      entry right away.
      
      Use nodememblk_range[] instead of nodes[] in order to make sure we
      capture the actual memory blocks registered with each node.  nodes[]
      contains an extended range which spans all memory regions associated
      with a node, but that does not mean that all the memory in between are
      included.
      Reported-by: NRuss Anderson <rja@sgi.com>
      Tested-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4CB27BDF.5000800@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org> 2.6.33 .34 .35 .36
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      73cf624d