1. 04 3月, 2011 1 次提交
    • T
      x86-64, NUMA: Revert NUMA affine page table allocation · f8911250
      Tejun Heo 提交于
      This patch reverts NUMA affine page table allocation added by commit
      1411e0ec (x86-64, numa: Put pgtable to local node memory).
      
      The commit made an undocumented change where the kernel linear mapping
      strictly follows intersection of e820 memory map and NUMA
      configuration.  If the physical memory configuration has holes or NUMA
      nodes are not properly aligned, this leads to using unnecessarily
      smaller mapping size which leads to increased TLB pressure.  For
      details,
      
        http://thread.gmane.org/gmane.linux.kernel/1104672
      
      Patches to fix the problem have been proposed but the underlying code
      needs more cleanup and the approach itself seems a bit heavy handed
      and it has been determined to revert the feature for now and come back
      to it in the next developement cycle.
      
        http://thread.gmane.org/gmane.linux.kernel/1105959
      
      As init_memory_mapping_high() callsites have been consolidated since
      the commit, reverting is done manually.  Also, the RED-PEN comment in
      arch/x86/mm/init.c is not restored as the problem no longer exists
      with memblock based top-down early memory allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      f8911250
  2. 24 2月, 2011 1 次提交
    • Y
      x86: Rename e820_table_* to pgt_buf_* · d1b19426
      Yinghai Lu 提交于
      e820_table_{start|end|top}, which are used to buffer page table
      allocation during early boot, are now derived from memblock and don't
      have much to do with e820.  Change the names so that they reflect what
      they're used for.
      
      This patch doesn't introduce any behavior change.
      
      -v2: Ingo found that earlier patch "x86: Use early pre-allocated page
           table buffer top-down" caused crash on 32bit and needed to be
           dropped.  This patch was updated to reflect the change.
      
      -tj: Updated commit description.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d1b19426
  3. 16 2月, 2011 2 次提交
    • T
      x86, NUMA: Move *_numa_init() invocations into initmem_init() · d8fc3afc
      Tejun Heo 提交于
      There's no reason for these to live in setup_arch().  Move them inside
      initmem_init().
      
      - v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
        Fixed.  Found by Ankita.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ankita Garg <ankita@in.ibm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      d8fc3afc
    • T
      x86, NUMA: Drop @start/last_pfn from initmem_init() · 86ef4dbf
      Tejun Heo 提交于
      initmem_init() extensively accesses and modifies global data
      structures and the parameters aren't even followed depending on which
      path is being used.  Drop @start/last_pfn and let it deal with
      @max_pfn directly.  This is in preparation for further NUMA init
      cleanups.
      
      - v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
        Fixed.  Found by Yinghai.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      86ef4dbf
  4. 30 12月, 2010 2 次提交
    • Y
      x86-64, numa: Put pgtable to local node memory · 1411e0ec
      Yinghai Lu 提交于
      Introduce init_memory_mapping_high(), and use it with 64bit.
      
      It will go with every memory segment above 4g to create page table to the
      memory range itself.
      
      before this patch all page tables was on one node.
      
      with this patch, one RED-PEN is killed
      
      debug out for 8 sockets system after patch
      [    0.000000] initial memory mapped : 0 - 20000000
      [    0.000000] init_memory_mapping: [0x00000000000000-0x0000007f74ffff]
      [    0.000000]  0000000000 - 007f600000 page 2M
      [    0.000000]  007f600000 - 007f750000 page 4k
      [    0.000000] kernel direct mapping tables up to 7f750000 @ [0x7f74c000-0x7f74ffff]
      [    0.000000] RAMDISK: 7bc84000 - 7f745000
      ....
      [    0.000000] Adding active range (0, 0x10, 0x95) 0 entries of 3200 used
      [    0.000000] Adding active range (0, 0x100, 0x7f750) 1 entries of 3200 used
      [    0.000000] Adding active range (0, 0x100000, 0x1080000) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x1080000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x2080000, 0x3080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x3080000, 0x4080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x4080000, 0x5080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0x5080000, 0x6080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0x6080000, 0x7080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0x7080000, 0x8080000) 9 entries of 3200 used
      [    0.000000] init_memory_mapping: [0x00000100000000-0x0000107fffffff]
      [    0.000000]  0100000000 - 1080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 1080000000 @ [0x107ffbd000-0x107fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x107ffc2000-0x107fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00001080000000-0x0000207fffffff]
      [    0.000000]  1080000000 - 2080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 2080000000 @ [0x207ff7d000-0x207fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x207ffc0000-0x207fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00002080000000-0x0000307fffffff]
      [    0.000000]  2080000000 - 3080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 3080000000 @ [0x307ff3d000-0x307fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x307ffc0000-0x307fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00003080000000-0x0000407fffffff]
      [    0.000000]  3080000000 - 4080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 4080000000 @ [0x407fefd000-0x407fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x407ffc0000-0x407fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00004080000000-0x0000507fffffff]
      [    0.000000]  4080000000 - 5080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 5080000000 @ [0x507febd000-0x507fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x507ffc0000-0x507fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00005080000000-0x0000607fffffff]
      [    0.000000]  5080000000 - 6080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 6080000000 @ [0x607fe7d000-0x607fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x607ffc0000-0x607fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00006080000000-0x0000707fffffff]
      [    0.000000]  6080000000 - 7080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 7080000000 @ [0x707fe3d000-0x707fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x707ffc0000-0x707fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00007080000000-0x0000807fffffff]
      [    0.000000]  7080000000 - 8080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 8080000000 @ [0x807fdfc000-0x807fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x807ffbf000-0x807fffffff]          PGTABLE
      [    0.000000] Initmem setup node 0 [0000000000000000-000000107fffffff]
      [    0.000000]   NODE_DATA [0x0000107ffbd000-0x0000107ffc1fff]
      [    0.000000] Initmem setup node 1 [0000001080000000-000000207fffffff]
      [    0.000000]   NODE_DATA [0x0000207ffbb000-0x0000207ffbffff]
      [    0.000000] Initmem setup node 2 [0000002080000000-000000307fffffff]
      [    0.000000]   NODE_DATA [0x0000307ffbb000-0x0000307ffbffff]
      [    0.000000] Initmem setup node 3 [0000003080000000-000000407fffffff]
      [    0.000000]   NODE_DATA [0x0000407ffbb000-0x0000407ffbffff]
      [    0.000000] Initmem setup node 4 [0000004080000000-000000507fffffff]
      [    0.000000]   NODE_DATA [0x0000507ffbb000-0x0000507ffbffff]
      [    0.000000] Initmem setup node 5 [0000005080000000-000000607fffffff]
      [    0.000000]   NODE_DATA [0x0000607ffbb000-0x0000607ffbffff]
      [    0.000000] Initmem setup node 6 [0000006080000000-000000707fffffff]
      [    0.000000]   NODE_DATA [0x0000707ffbb000-0x0000707ffbffff]
      [    0.000000] Initmem setup node 7 [0000007080000000-000000807fffffff]
      [    0.000000]   NODE_DATA [0x0000807ffba000-0x0000807ffbefff]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4D1933D1.9020609@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      1411e0ec
    • Y
      x86-64, mm: Put early page table high · 4b239f45
      Yinghai Lu 提交于
      While dubug kdump, found current kernel will have problem with crashkernel=512M.
      
      It turns out that initial mapping is to 512M, and later initial mapping to 4G
      (acutally is 2040M in my platform), will put page table near 512M.
      then initial mapping to 128g will be near 2g.
      
      before this patch:
      [    0.000000] initial memory mapped : 0 - 20000000
      [    0.000000] init_memory_mapping: [0x00000000000000-0x0000007f74ffff]
      [    0.000000]  0000000000 - 007f600000 page 2M
      [    0.000000]  007f600000 - 007f750000 page 4k
      [    0.000000] kernel direct mapping tables up to 7f750000 @ [0x1fffc000-0x1fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x1fffc000-0x1fffdfff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00000100000000-0x0000207fffffff]
      [    0.000000]  0100000000 - 2080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 2080000000 @ [0x7bc01000-0x7bc83fff]
      [    0.000000]     memblock_x86_reserve_range: [0x7bc01000-0x7bc7efff]          PGTABLE
      [    0.000000] RAMDISK: 7bc84000 - 7f745000
      [    0.000000] crashkernel reservation failed - No suitable area found.
      
      after patch:
      [    0.000000] initial memory mapped : 0 - 20000000
      [    0.000000] init_memory_mapping: [0x00000000000000-0x0000007f74ffff]
      [    0.000000]  0000000000 - 007f600000 page 2M
      [    0.000000]  007f600000 - 007f750000 page 4k
      [    0.000000] kernel direct mapping tables up to 7f750000 @ [0x7f74c000-0x7f74ffff]
      [    0.000000]     memblock_x86_reserve_range: [0x7f74c000-0x7f74dfff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00000100000000-0x0000207fffffff]
      [    0.000000]  0100000000 - 2080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 2080000000 @ [0x207ff7d000-0x207fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x207ff7d000-0x207fffafff]          PGTABLE
      [    0.000000] RAMDISK: 7bc84000 - 7f745000
      [    0.000000]     memblock_x86_reserve_range: [0x17000000-0x36ffffff]     CRASH KERNEL
      [    0.000000] Reserving 512MB of memory at 368MB for crashkernel (System RAM: 133120MB)
      
      It means with the patch, page table for [0, 2g) will need 2g, instead of under 512M,
      page table for [4g, 128g) will be near 128g, instead of under 2g.
      
      That would good, if we have lots of memory above 4g, like 1024g, or 2048g or 16T, will not put
      related page table under 2g. that would be have chance to fill the under 2g if 1G or 2M page is
      not used.
      
      the code change will use add map_low_page() and update unmap_low_page() for 64bit, and use them
      to get access the corresponding high memory for page table setting.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4D0C0734.7060900@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      4b239f45
  5. 22 11月, 2010 1 次提交
    • L
      x86: Resume trampoline must be executable · 691513f7
      Lin Ming 提交于
      commit 5bd5a452(x86: Add NX protection for kernel data) marked the
      trampoline area NX - which unsurprisingly breaks resume and cpu
      hotplug.
      
      Revert the portion of that commit, which touches the trampoline.
      
      Originally-from: Lin Ming <ming.m.lin@intel.com>
      LKML-Reference: <1290410581.2405.24.camel@minggr.sh.intel.com>
      Cc: Matthieu Castet <castet.matthieu@free.fr>
      Cc: Siarhei Liakh <sliakh.lkml@gmail.com>
      Cc: Xuxian Jiang <jiang@cs.ncsu.edu>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Tested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      691513f7
  6. 18 11月, 2010 1 次提交
    • M
      x86: Add NX protection for kernel data · 5bd5a452
      Matthieu Castet 提交于
      This patch expands functionality of CONFIG_DEBUG_RODATA to set main
      (static) kernel data area as NX.
      
      The following steps are taken to achieve this:
      
       1. Linker script is adjusted so .text always starts and ends on a page bound
       2. Linker script is adjusted so .rodata always start and end on a page boundary
       3. NX is set for all pages from _etext through _end in mark_rodata_ro.
       4. free_init_pages() sets released memory NX in arch/x86/mm/init.c
       5. bios rom is set to x when pcibios is used.
      
      The results of patch application may be observed in the diff of kernel page
      table dumps:
      
      pcibios:
      
       -- data_nx_pt_before.txt       2009-10-13 07:48:59.000000000 -0400
       ++ data_nx_pt_after.txt        2009-10-13 07:26:46.000000000 -0400
        0x00000000-0xc0000000           3G                           pmd
        ---[ Kernel Mapping ]---
       -0xc0000000-0xc0100000           1M     RW             GLB x  pte
       +0xc0000000-0xc00a0000         640K     RW             GLB NX pte
       +0xc00a0000-0xc0100000         384K     RW             GLB x  pte
       -0xc0100000-0xc03d7000        2908K     ro             GLB x  pte
       +0xc0100000-0xc0318000        2144K     ro             GLB x  pte
       +0xc0318000-0xc03d7000         764K     ro             GLB NX pte
       -0xc03d7000-0xc0600000        2212K     RW             GLB x  pte
       +0xc03d7000-0xc0600000        2212K     RW             GLB NX pte
        0xc0600000-0xf7a00000         884M     RW         PSE GLB NX pmd
        0xf7a00000-0xf7bfe000        2040K     RW             GLB NX pte
        0xf7bfe000-0xf7c00000           8K                           pte
      
      No pcibios:
      
       -- data_nx_pt_before.txt       2009-10-13 07:48:59.000000000 -0400
       ++ data_nx_pt_after.txt        2009-10-13 07:26:46.000000000 -0400
        0x00000000-0xc0000000           3G                           pmd
        ---[ Kernel Mapping ]---
       -0xc0000000-0xc0100000           1M     RW             GLB x  pte
       +0xc0000000-0xc0100000           1M     RW             GLB NX pte
       -0xc0100000-0xc03d7000        2908K     ro             GLB x  pte
       +0xc0100000-0xc0318000        2144K     ro             GLB x  pte
       +0xc0318000-0xc03d7000         764K     ro             GLB NX pte
       -0xc03d7000-0xc0600000        2212K     RW             GLB x  pte
       +0xc03d7000-0xc0600000        2212K     RW             GLB NX pte
        0xc0600000-0xf7a00000         884M     RW         PSE GLB NX pmd
        0xf7a00000-0xf7bfe000        2040K     RW             GLB NX pte
        0xf7bfe000-0xf7c00000           8K                           pte
      
      The patch has been originally developed for Linux 2.6.34-rc2 x86 by
      Siarhei Liakh <sliakh.lkml@gmail.com> and Xuxian Jiang <jiang@cs.ncsu.edu>.
      
       -v1:  initial patch for 2.6.30
       -v2:  patch for 2.6.31-rc7
       -v3:  moved all code into arch/x86, adjusted credits
       -v4:  fixed ifdef, removed credits from CREDITS
       -v5:  fixed an address calculation bug in mark_nxdata_nx()
       -v6:  added acked-by and PT dump diff to commit log
       -v7:  minor adjustments for -tip
       -v8:  rework with the merge of "Set first MB as RW+NX"
      Signed-off-by: NSiarhei Liakh <sliakh.lkml@gmail.com>
      Signed-off-by: NXuxian Jiang <jiang@cs.ncsu.edu>
      Signed-off-by: NMatthieu CASTET <castet.matthieu@free.fr>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <4CE2F82E.60601@free.fr>
      [ minor cleanliness edits ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5bd5a452
  7. 28 10月, 2010 1 次提交
  8. 20 10月, 2010 2 次提交
  9. 23 9月, 2010 1 次提交
  10. 03 9月, 2010 1 次提交
  11. 28 8月, 2010 4 次提交
  12. 27 8月, 2010 2 次提交
    • H
      x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes · 9b861528
      Haicheng Li 提交于
      When memory hotplug-adding happens for a large enough area
      that a new PGD entry is needed for the direct mapping, the PGDs
      of other processes would not get updated. This leads to some CPUs
      oopsing like below when they have to access the unmapped areas.
      
      [ 1139.243192] BUG: soft lockup - CPU#0 stuck for 61s! [bash:6534]
      [ 1139.243195] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill binfmt_misc
      dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan battery ac parport_pc
      lp parport joydev usbhid processor thermal thermal_sys container button rtc_cmos rtc_core rtc_lib
      i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd ehci_hcd usbcore
      [ 1139.243229] irq event stamp: 8538759
      [ 1139.243230] hardirqs last  enabled at (8538759): [<ffffffff8100c3fc>] restore_args+0x0/0x30
      [ 1139.243236] hardirqs last disabled at (8538757): [<ffffffff810422df>] __do_softirq+0x106/0x146
      [ 1139.243240] softirqs last  enabled at (8538758): [<ffffffff81042310>] __do_softirq+0x137/0x146
      [ 1139.243245] softirqs last disabled at (8538743): [<ffffffff8100cb5c>] call_softirq+0x1c/0x34
      [ 1139.243249] CPU 0:
      [ 1139.243250] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill binfmt_misc
      dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan battery ac parport_pc
      lp parport joydev usbhid processor thermal thermal_sys container button rtc_cmos rtc_core rtc_lib
      i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd ehci_hcd usbcore
      [ 1139.243284] Pid: 6534, comm: bash Tainted: G   M       2.6.32-haicheng-cpuhp #7 QSSC-S4R
      [ 1139.243287] RIP: 0010:[<ffffffff810ace35>]  [<ffffffff810ace35>] alloc_arraycache+0x35/0x69
      [ 1139.243292] RSP: 0018:ffff8802799f9d78  EFLAGS: 00010286
      [ 1139.243295] RAX: ffff8884ffc00000 RBX: ffff8802799f9d98 RCX: 0000000000000000
      [ 1139.243297] RDX: 0000000000190018 RSI: 0000000000000001 RDI: ffff8884ffc00010
      [ 1139.243300] RBP: ffffffff8100c34e R08: 0000000000000002 R09: 0000000000000000
      [ 1139.243303] R10: ffffffff8246dda0 R11: 000000d08246dda0 R12: ffff8802599bfff0
      [ 1139.243305] R13: ffff88027904c040 R14: ffff8802799f8000 R15: 0000000000000001
      [ 1139.243308] FS:  00007fe81bfe86e0(0000) GS:ffff88000d800000(0000) knlGS:0000000000000000
      [ 1139.243311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1139.243313] CR2: ffff8884ffc00000 CR3: 000000026cf2d000 CR4: 00000000000006f0
      [ 1139.243316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1139.243318] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 1139.243321] Call Trace:
      [ 1139.243324]  [<ffffffff810ace29>] ? alloc_arraycache+0x29/0x69
      [ 1139.243328]  [<ffffffff8135004e>] ? cpuup_callback+0x1b0/0x32a
      [ 1139.243333]  [<ffffffff8105385d>] ? notifier_call_chain+0x33/0x5b
      [ 1139.243337]  [<ffffffff810538a4>] ? __raw_notifier_call_chain+0x9/0xb
      [ 1139.243340]  [<ffffffff8134ecfc>] ? cpu_up+0xb3/0x152
      [ 1139.243344]  [<ffffffff813388ce>] ? store_online+0x4d/0x75
      [ 1139.243348]  [<ffffffff811e53f3>] ? sysdev_store+0x1b/0x1d
      [ 1139.243351]  [<ffffffff8110589f>] ? sysfs_write_file+0xe5/0x121
      [ 1139.243355]  [<ffffffff810b539d>] ? vfs_write+0xae/0x14a
      [ 1139.243358]  [<ffffffff810b587f>] ? sys_write+0x47/0x6f
      [ 1139.243362]  [<ffffffff8100b9ab>] ? system_call_fastpath+0x16/0x1b
      
      This patch makes sure to always replicate new direct mapping PGD entries
      to the PGDs of all processes, as well as ensures corresponding vmemmap
      mapping gets synced.
      
      V1: initial code by Andi Kleen.
      V2: fix several issues found in testing.
      V3: as suggested by Wu Fengguang, reuse common code of vmalloc_sync_all().
      
      [ hpa: changed pgd_change from int to bool ]
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      LKML-Reference: <4C6E4FD8.6080100@linux.intel.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      9b861528
    • H
      x86, mm: Separate x86_64 vmalloc_sync_all() into separate functions · 6afb5157
      Haicheng Li 提交于
      No behavior change.
      
      Move some of vmalloc_sync_all() code into a new function
      sync_global_pgds() that will be useful for memory hotplug.
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      LKML-Reference: <4C6E4ECD.1090607@linux.intel.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      6afb5157
  13. 19 7月, 2010 1 次提交
  14. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  15. 13 2月, 2010 2 次提交
    • Y
      sparsemem: Put mem map for one node together. · 9bdac914
      Yinghai Lu 提交于
      Add vmemmap_alloc_block_buf for mem map only.
      
      It will fallback to the old way if it cannot get a block that big.
      
      Before this patch, when a node have 128g ram installed, memmap are
      split into two parts or more.
      [    0.000000]  [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
      [    0.000000]  [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
      [    0.000000]  [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
      [    0.000000]  [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
      [    0.000000]  [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
      [    0.000000]  [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
      [    0.000000]  [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
      [    0.000000]  [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
      [    0.000000]  [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
      [    0.000000]  [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
      [    0.000000]  [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
      [    0.000000]  [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
      [    0.000000]  [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
      [    0.000000]  [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
      [    0.000000]  [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7
      
      after patch will get
      [    0.000000]  [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
      [    0.000000]  [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
      [    0.000000]  [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7
      
      -v2: change buf to vmemmap_buf instead according to Ingo
           also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
      -v3: according to Andrew, use sizeof(name) instead of hard coded 15
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-19-git-send-email-yinghai@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      9bdac914
    • Y
      x86: Make 64 bit use early_res instead of bootmem before slab · 08677214
      Yinghai Lu 提交于
      Finally we can use early_res to replace bootmem for x86_64 now.
      
      Still can use CONFIG_NO_BOOTMEM to enable it or not.
      
      -v2: fix 32bit compiling about MAX_DMA32_PFN
      -v3: folded bug fix from LKML message below
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4B747239.4070907@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      08677214
  16. 11 2月, 2010 1 次提交
  17. 03 2月, 2010 1 次提交
  18. 03 11月, 2009 2 次提交
  19. 20 10月, 2009 2 次提交
    • S
      x86-64: align RODATA kernel section to 2MB with CONFIG_DEBUG_RODATA · 74e08179
      Suresh Siddha 提交于
      CONFIG_DEBUG_RODATA chops the large pages spanning boundaries of kernel
      text/rodata/data to small 4KB pages as they are mapped with different
      attributes (text as RO, RODATA as RO and NX etc).
      
      On x86_64, preserve the large page mappings for kernel text/rodata/data
      boundaries when CONFIG_DEBUG_RODATA is enabled. This is done by allowing the
      RODATA section to be hugepage aligned and having same RWX attributes
      for the 2MB page boundaries
      
      Extra Memory pages padding the sections will be freed during the end of the boot
      and the kernel identity mappings will have different RWX permissions compared to
      the kernel text mappings.
      
      Kernel identity mappings to these physical pages will be mapped with smaller
      pages but large page mappings are still retained for kernel text,rodata,data
      mappings.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <20091014220254.190119924@sbs-t61.sc.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      74e08179
    • S
      x86-64: preserve large page mapping for 1st 2MB kernel txt with CONFIG_DEBUG_RODATA · b9af7c0d
      Suresh Siddha 提交于
      In the first 2MB, kernel text is co-located with kernel static
      page tables setup by head_64.S.  CONFIG_DEBUG_RODATA chops this
      2MB large page mapping to small 4KB pages as we mark the kernel text as RO,
      leaving the static page tables as RW.
      
      With CONFIG_DEBUG_RODATA disabled, OLTP run on NHM-EP shows 1% improvement
      with 2% reduction in system time and 1% improvement in iowait idle time.
      
      To recover this, move the kernel static page tables to .data section, so that
      we don't have to break the first 2MB of kernel text to small pages with
      CONFIG_DEBUG_RODATA.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <20091014220254.063193621@sbs-t61.sc.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      b9af7c0d
  20. 13 10月, 2009 1 次提交
    • D
      x86: Export k8 physical topology · 8ee2debc
      David Rientjes 提交于
      To eventually interleave emulated nodes over physical nodes, we
      need to know the physical topology of the machine without actually
      registering it.  This does the k8 node setup in two parts:
      detection and registration.  NUMA emulation can then used the
      physical topology detected to setup the address ranges of emulated
      nodes accordingly.  If emulation isn't used, the k8 nodes are
      registered as normal.
      
      Two formals are added to the x86 NUMA setup functions: `acpi' and
      `k8'. These represent whether ACPI or K8 NUMA has been detected;
      both cannot be true at the same time.  This specifies to the NUMA
      emulation code whether an underlying physical NUMA topology exists
      and which interface to use.
      
      This patch deals solely with separating the k8 setup path into
      Northbridge detection and registration steps and leaves the ACPI
      changes for a subsequent patch.  The `acpi' formal is added here,
      however, to avoid touching all the header files again in the next
      patch.
      
      This approach also ensures emulated nodes will not span physical
      nodes so the true memory latency is not misrepresented.
      
      k8_get_nodes() may now be used to export the k8 physical topology
      of the machine for NUMA emulation.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Ankita Garg <ankita@in.ibm.com>
      Cc: Len Brown <len.brown@intel.com>
      LKML-Reference: <alpine.DEB.1.00.0909251518400.14754@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8ee2debc
  21. 23 9月, 2009 5 次提交
  22. 22 9月, 2009 1 次提交
  23. 25 8月, 2009 1 次提交
  24. 21 8月, 2009 1 次提交
  25. 09 7月, 2009 1 次提交
  26. 01 7月, 2009 1 次提交
    • Y
      x86: only clear node_states for 64bit · 66918dcd
      Yinghai Lu 提交于
      Nathan reported that
      
      | commit 73d60b7f
      | Author: Yinghai Lu <yinghai@kernel.org>
      | Date:   Tue Jun 16 15:33:00 2009 -0700
      |
      |    page-allocator: clear N_HIGH_MEMORY map before we set it again
      |
      |    SRAT tables may contains nodes of very small size.  The arch code may
      |    decide to not activate such a node.  However, currently the early boot
      |    code sets N_HIGH_MEMORY for such nodes.  These nodes therefore seem to be
      |    active although these nodes have no present pages.
      |
      |    For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too
      
      unintentionally and incorrectly clears the cpuset.mems cgroup attribute on
      an i386 kvm guest, meaning that cpuset.mems can not be used.
      
      Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
      and need to do save/restore for that in find_zone_movable_pfn
      Reported-by: NNathan Lynch <ntl@pobox.com>
      Tested-by: NNathan Lynch <ntl@pobox.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66918dcd