1. 01 11月, 2010 1 次提交
    • R
      x86, mm: Fix section mismatch in tlb.c · cf38d0ba
      Rakib Mullick 提交于
      Mark tlb_cpuhp_notify as __cpuinit. It's basically a callback
      function, which is called from __cpuinit init_smp_flash(). So -
      it's safe.
      
      We were warned by the following warning:
      
       WARNING: arch/x86/mm/built-in.o(.text+0x356d): Section mismatch
       in reference from the function tlb_cpuhp_notify() to the
       function .cpuinit.text:calculate_tlb_offset()
       The function tlb_cpuhp_notify() references
       the function __cpuinit calculate_tlb_offset().
       This is often because tlb_cpuhp_notify lacks a __cpuinit
       annotation or the annotation of calculate_tlb_offset is wrong.
      Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      LKML-Reference: <AANLkTinWQRG=HA9uB3ad0KAqRRTinL6L_4iKgF84coph@mail.gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cf38d0ba
  2. 29 10月, 2010 1 次提交
  3. 28 10月, 2010 2 次提交
  4. 27 10月, 2010 3 次提交
    • M
      x86: access_error API cleanup · 68da336a
      Michel Lespinasse 提交于
      access_error() already takes error_code as an argument, so there is
      no need for an additional write flag.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68da336a
    • M
      mm: retry page fault when blocking on disk transfer · d065bd81
      Michel Lespinasse 提交于
      This change reduces mmap_sem hold times that are caused by waiting for
      disk transfers when accessing file mapped VMAs.
      
      It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
      site wants mmap_sem to be released if blocking on a pending disk transfer.
      In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
      do_page_fault() will then re-acquire mmap_sem and retry the page fault.
      
      It is expected that the retry will hit the same page which will now be
      cached, and thus it will complete with a low mmap_sem hold time.
      
      Tests:
      
      - microbenchmark: thread A mmaps a large file and does random read accesses
        to the mmaped area - achieves about 55 iterations/s. Thread B does
        mmap/munmap in a loop at a separate location - achieves 55 iterations/s
        before, 15000 iterations/s after.
      
      - We are seeing related effects in some applications in house, which show
        significant performance regressions when running without this change.
      
      [akpm@linux-foundation.org: fix warning & crash]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d065bd81
    • P
      mm: stack based kmap_atomic() · 3e4d3af5
      Peter Zijlstra 提交于
      Keep the current interface but ignore the KM_type and use a stack based
      approach.
      
      The advantage is that we get rid of crappy code like:
      
      	#define __KM_PTE			\
      		(in_nmi() ? KM_NMI_PTE : 	\
      		 in_irq() ? KM_IRQ_PTE :	\
      		 KM_PTE0)
      
      and in general can stop worrying about what context we're in and what kmap
      slots might be appropriate for that.
      
      The downside is that FRV kmap_atomic() gets more expensive.
      
      For now we use a CPP trick suggested by Andrew:
      
        #define kmap_atomic(page, args...) __kmap_atomic(page)
      
      to avoid having to touch all kmap_atomic() users in a single patch.
      
      [ not compiled on:
        - mn10300: the arch doesn't actually build with highmem to begin with ]
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix up drivers/gpu/drm/i915/intel_overlay.c]
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e4d3af5
  5. 21 10月, 2010 3 次提交
    • S
      x86: Spread tlb flush vector between nodes · 93296720
      Shaohua Li 提交于
      Currently flush tlb vector allocation is based on below equation:
      	sender = smp_processor_id() % 8
      This isn't optimal, CPUs from different node can have the same vector, this
      causes a lot of lock contention. Instead, we can assign the same vectors to
      CPUs from the same node, while different node has different vectors. This has
      below advantages:
      a. if there is lock contention, the lock contention is between CPUs from one
      node. This should be much cheaper than the contention between nodes.
      b. completely avoid lock contention between nodes. This especially benefits
      kswapd, which is the biggest user of tlb flush, since kswapd sets its affinity
      to specific node.
      
      In my test, this could reduce > 20% CPU overhead in extreme case.The test
      machine has 4 nodes and each node has 16 CPUs. I then bind each node's kswapd
      to the first CPU of the node. I run a workload with 4 sequential mmap file
      read thread. The files are empty sparse file. This workload will trigger a
      lot of page reclaim and tlbflush. The kswapd bind is to easy trigger the
      extreme tlb flush lock contention because otherwise kswapd keeps migrating
      between CPUs of a node and I can't get stable result. Sure in real workload,
      we can't always see so big tlb flush lock contention, but it's possible.
      
      [ hpa: folded in fix from Eric Dumazet to use this_cpu_read() ]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1287544023.4571.8.camel@sli10-conroe.sh.intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      93296720
    • B
      x86-32, mm: Add an initial page table for core bootstrapping · b40827fa
      Borislav Petkov 提交于
      This patch adds an initial page table with low mappings used exclusively
      for booting APs/resuming after ACPI suspend/machine restart. After this,
      there's no need to add low mappings to swapper_pg_dir and zap them later
      or create own swsusp PGD page solely for ACPI sleep needs - we have
      initial_page_table for that.
      Signed-off-by: NBorislav Petkov <bp@alien8.de>
      LKML-Reference: <20101020070526.GA9588@liondog.tnic>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      b40827fa
    • B
      x86, mm: Fix incorrect data type in vmalloc_sync_all() · f01f7c56
      Borislav Petkov 提交于
      arch/x86/mm/fault.c: In function 'vmalloc_sync_all':
      arch/x86/mm/fault.c:238: warning: assignment makes integer from pointer without a cast
      
      introduced by 617d34d9.
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      LKML-Reference: <20101020103642.GA3135@kryptos.osrc.amd.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      f01f7c56
  6. 20 10月, 2010 2 次提交
  7. 15 10月, 2010 1 次提交
    • F
      x86: Barf when vmalloc and kmemcheck faults happen in NMI · ebc8827f
      Frederic Weisbecker 提交于
      In x86, faults exit by executing the iret instruction, which then
      reenables NMIs if we faulted in NMI context. Then if a fault
      happens in NMI, another NMI can nest after the fault exits.
      
      But we don't yet support nested NMIs because we have only one NMI
      stack. To prevent from that, check that vmalloc and kmemcheck
      faults don't happen in this context. Most of the other kernel faults
      in NMIs can be more easily spotted by finding explicit
      copy_from,to_user() calls on review.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      ebc8827f
  8. 14 10月, 2010 1 次提交
    • J
      xen: Cope with unmapped pages when initializing kernel pagetable · fef5ba79
      Jeremy Fitzhardinge 提交于
      Xen requires that all pages containing pagetable entries to be mapped
      read-only.  If pages used for the initial pagetable are already mapped
      then we can change the mapping to RO.  However, if they are initially
      unmapped, we need to make sure that when they are later mapped, they
      are also mapped RO.
      
      We do this by knowing that the kernel pagetable memory is pre-allocated
      in the range e820_table_start - e820_table_end, so any pfn within this
      range should be mapped read-only.  However, the pagetable setup code
      early_ioremaps the pages to write their entries, so we must make sure
      that mappings created in the early_ioremap fixmap area are mapped RW.
      (Those mappings are removed before the pages are presented to Xen
      as pagetable pages.)
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      LKML-Reference: <4CB63A80.8060702@goop.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      fef5ba79
  9. 12 10月, 2010 1 次提交
    • Y
      x86, numa: For each node, register the memory blocks actually used · 73cf624d
      Yinghai Lu 提交于
      Russ reported SGI UV is broken recently. He said:
      
      | The SRAT table shows that memory range is spread over two nodes.
      |
      | SRAT: Node 0 PXM 0 100000000-800000000
      | SRAT: Node 1 PXM 1 800000000-1000000000
      | SRAT: Node 0 PXM 0 1000000000-1080000000
      |
      |Previously, the kernel early_node_map[] would show three entries
      |with the proper node.
      |
      |[    0.000000]     0: 0x00100000 -> 0x00800000
      |[    0.000000]     1: 0x00800000 -> 0x01000000
      |[    0.000000]     0: 0x01000000 -> 0x01080000
      |
      |The problem is recent community kernel early_node_map[] shows
      |only two entries with the node 0 entry overlapping the node 1
      |entry.
      |
      |    0: 0x00100000 -> 0x01080000
      |    1: 0x00800000 -> 0x01000000
      
      After looking at the changelog, Found out that it has been broken for a while by
      following commit
      
      |commit 8716273c
      |Author: David Rientjes <rientjes@google.com>
      |Date:   Fri Sep 25 15:20:04 2009 -0700
      |
      |    x86: Export srat physical topology
      
      Before that commit, register_active_regions() is called for every SRAT memory
      entry right away.
      
      Use nodememblk_range[] instead of nodes[] in order to make sure we
      capture the actual memory blocks registered with each node.  nodes[]
      contains an extended range which spans all memory regions associated
      with a node, but that does not mean that all the memory in between are
      included.
      Reported-by: NRuss Anderson <rja@sgi.com>
      Tested-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4CB27BDF.5000800@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org> 2.6.33 .34 .35 .36
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      73cf624d
  10. 08 10月, 2010 1 次提交
    • A
      x86: HWPOISON: Report correct address granuality for huge hwpoison faults · f672b49b
      Andi Kleen 提交于
      An earlier patch fixed the hwpoison fault handling to encode the
      huge page size in the fault code of the page fault handler.
      
      This is needed to report this information in SIGBUS to user space.
      
      This is a straight forward patch to pass this information
      through to the signal handling in the x86 specific fault.c
      
      Cc: x86@kernel.org
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: fengguang.wu@intel.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      f672b49b
  11. 06 10月, 2010 2 次提交
  12. 23 9月, 2010 1 次提交
  13. 21 9月, 2010 1 次提交
  14. 05 9月, 2010 1 次提交
  15. 03 9月, 2010 1 次提交
  16. 30 8月, 2010 1 次提交
  17. 28 8月, 2010 14 次提交
  18. 27 8月, 2010 3 次提交
    • S
      x86, mm: Make spurious_fault check explicitly check the PRESENT bit · 660a293e
      Shaohua Li 提交于
      pte_present() returns true even present bit isn't set but _PAGE_PROTNONE
      (global bit) bit is set. While with CONFIG_DEBUG_PAGEALLOC, free pages have
      global bit set but present bit clear. This patch makes we could catch
      free pages access with CONFIG_DEBUG_PAGEALLOC enabled.
      
      [ hpa: added a comment in the code as a warning to janitors ]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <1280217988.32400.75.camel@sli10-desk.sh.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      660a293e
    • H
      x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes · 9b861528
      Haicheng Li 提交于
      When memory hotplug-adding happens for a large enough area
      that a new PGD entry is needed for the direct mapping, the PGDs
      of other processes would not get updated. This leads to some CPUs
      oopsing like below when they have to access the unmapped areas.
      
      [ 1139.243192] BUG: soft lockup - CPU#0 stuck for 61s! [bash:6534]
      [ 1139.243195] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill binfmt_misc
      dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan battery ac parport_pc
      lp parport joydev usbhid processor thermal thermal_sys container button rtc_cmos rtc_core rtc_lib
      i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd ehci_hcd usbcore
      [ 1139.243229] irq event stamp: 8538759
      [ 1139.243230] hardirqs last  enabled at (8538759): [<ffffffff8100c3fc>] restore_args+0x0/0x30
      [ 1139.243236] hardirqs last disabled at (8538757): [<ffffffff810422df>] __do_softirq+0x106/0x146
      [ 1139.243240] softirqs last  enabled at (8538758): [<ffffffff81042310>] __do_softirq+0x137/0x146
      [ 1139.243245] softirqs last disabled at (8538743): [<ffffffff8100cb5c>] call_softirq+0x1c/0x34
      [ 1139.243249] CPU 0:
      [ 1139.243250] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill binfmt_misc
      dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan battery ac parport_pc
      lp parport joydev usbhid processor thermal thermal_sys container button rtc_cmos rtc_core rtc_lib
      i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd ehci_hcd usbcore
      [ 1139.243284] Pid: 6534, comm: bash Tainted: G   M       2.6.32-haicheng-cpuhp #7 QSSC-S4R
      [ 1139.243287] RIP: 0010:[<ffffffff810ace35>]  [<ffffffff810ace35>] alloc_arraycache+0x35/0x69
      [ 1139.243292] RSP: 0018:ffff8802799f9d78  EFLAGS: 00010286
      [ 1139.243295] RAX: ffff8884ffc00000 RBX: ffff8802799f9d98 RCX: 0000000000000000
      [ 1139.243297] RDX: 0000000000190018 RSI: 0000000000000001 RDI: ffff8884ffc00010
      [ 1139.243300] RBP: ffffffff8100c34e R08: 0000000000000002 R09: 0000000000000000
      [ 1139.243303] R10: ffffffff8246dda0 R11: 000000d08246dda0 R12: ffff8802599bfff0
      [ 1139.243305] R13: ffff88027904c040 R14: ffff8802799f8000 R15: 0000000000000001
      [ 1139.243308] FS:  00007fe81bfe86e0(0000) GS:ffff88000d800000(0000) knlGS:0000000000000000
      [ 1139.243311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1139.243313] CR2: ffff8884ffc00000 CR3: 000000026cf2d000 CR4: 00000000000006f0
      [ 1139.243316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1139.243318] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 1139.243321] Call Trace:
      [ 1139.243324]  [<ffffffff810ace29>] ? alloc_arraycache+0x29/0x69
      [ 1139.243328]  [<ffffffff8135004e>] ? cpuup_callback+0x1b0/0x32a
      [ 1139.243333]  [<ffffffff8105385d>] ? notifier_call_chain+0x33/0x5b
      [ 1139.243337]  [<ffffffff810538a4>] ? __raw_notifier_call_chain+0x9/0xb
      [ 1139.243340]  [<ffffffff8134ecfc>] ? cpu_up+0xb3/0x152
      [ 1139.243344]  [<ffffffff813388ce>] ? store_online+0x4d/0x75
      [ 1139.243348]  [<ffffffff811e53f3>] ? sysdev_store+0x1b/0x1d
      [ 1139.243351]  [<ffffffff8110589f>] ? sysfs_write_file+0xe5/0x121
      [ 1139.243355]  [<ffffffff810b539d>] ? vfs_write+0xae/0x14a
      [ 1139.243358]  [<ffffffff810b587f>] ? sys_write+0x47/0x6f
      [ 1139.243362]  [<ffffffff8100b9ab>] ? system_call_fastpath+0x16/0x1b
      
      This patch makes sure to always replicate new direct mapping PGD entries
      to the PGDs of all processes, as well as ensures corresponding vmemmap
      mapping gets synced.
      
      V1: initial code by Andi Kleen.
      V2: fix several issues found in testing.
      V3: as suggested by Wu Fengguang, reuse common code of vmalloc_sync_all().
      
      [ hpa: changed pgd_change from int to bool ]
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      LKML-Reference: <4C6E4FD8.6080100@linux.intel.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      9b861528
    • H
      x86, mm: Separate x86_64 vmalloc_sync_all() into separate functions · 6afb5157
      Haicheng Li 提交于
      No behavior change.
      
      Move some of vmalloc_sync_all() code into a new function
      sync_global_pgds() that will be useful for memory hotplug.
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      LKML-Reference: <4C6E4ECD.1090607@linux.intel.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      6afb5157