1. 17 11月, 2009 1 次提交
    • H
      x86, pageattr: Make set_memory_(x|nx) aware of NX support · 583140af
      H. Peter Anvin 提交于
      Make set_memory_x/set_memory_nx directly aware of if NX is supported
      in the system or not, rather than requiring that every caller assesses
      that support independently.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tim Starling <tstarling@wikimedia.org>
      Cc: Hannes Eder <hannes@hanneseder.net>
      LKML-Reference: <1258154897-6770-4-git-send-email-hpa@zytor.com>
      Acked-by: NKees Cook <kees.cook@canonical.com>
      583140af
  2. 03 11月, 2009 2 次提交
  3. 28 10月, 2009 1 次提交
    • S
      tracing: allow to change permissions for text with dynamic ftrace enabled · 883242dd
      Steven Rostedt 提交于
      The commit 74e08179
      x86-64: align RODATA kernel section to 2MB with CONFIG_DEBUG_RODATA
      prevents text sections from becoming read/write using set_memory_rw.
      
      The dynamic ftrace changes all text pages to read/write just before
      converting the calls to tracing to nops, and vice versa.
      
      I orginally just added a flag to allow this transaction when ftrace
      did the change, but I also found that when the CPA testing was running
      it would remove the read/write as well, and ftrace does not do the text
      conversion on boot up, and the CPA changes caused the dynamic tracer
      to fail on self tests.
      
      The current solution I have is to simply not to prevent
      change_page_attr from setting the RW bit for kernel text pages.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      883242dd
  4. 20 10月, 2009 1 次提交
    • S
      x86-64: align RODATA kernel section to 2MB with CONFIG_DEBUG_RODATA · 74e08179
      Suresh Siddha 提交于
      CONFIG_DEBUG_RODATA chops the large pages spanning boundaries of kernel
      text/rodata/data to small 4KB pages as they are mapped with different
      attributes (text as RO, RODATA as RO and NX etc).
      
      On x86_64, preserve the large page mappings for kernel text/rodata/data
      boundaries when CONFIG_DEBUG_RODATA is enabled. This is done by allowing the
      RODATA section to be hugepage aligned and having same RWX attributes
      for the 2MB page boundaries
      
      Extra Memory pages padding the sections will be freed during the end of the boot
      and the kernel identity mappings will have different RWX permissions compared to
      the kernel text mappings.
      
      Kernel identity mappings to these physical pages will be mapped with smaller
      pages but large page mappings are still retained for kernel text,rodata,data
      mappings.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <20091014220254.190119924@sbs-t61.sc.intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      74e08179
  5. 12 9月, 2009 1 次提交
    • E
      agp/intel: Fix the pre-9xx chipset flush. · e517a5e9
      Eric Anholt 提交于
      Ever since we enabled GEM, the pre-9xx chipsets (particularly 865) have had
      serious stability issues.  Back in May a wbinvd was added to the DRM to
      work around much of the problem.  Some failure remained -- easily visible
      by dragging a window around on an X -retro desktop, or by looking at bugzilla.
      
      The chipset flush was on the right track -- hitting the right amount of
      memory, and it appears to be the only way to flush on these chipsets, but the
      flush page was mapped uncached.  As a result, the writes trying to clear the
      writeback cache ended up bypassing the cache, and not flushing anything!  The
      wbinvd would flush out other writeback data and often cause the data we wanted
      to get flushed, but not always.  By removing the setting of the page to UC
      and instead just clflushing the data we write to try to flush it, we get the
      desired behavior with no wbinvd.
      
      This exports clflush_cache_range(), which was laying around and happened to
      basically match the code I was otherwise going to copy from the DRM.
      Signed-off-by: NEric Anholt <eric@anholt.net>
      Signed-off-by: NBrice Goglin <Brice.Goglin@ens-lyon.org>
      Cc: stable@kernel.org
      e517a5e9
  6. 10 9月, 2009 1 次提交
  7. 14 8月, 2009 1 次提交
  8. 04 8月, 2009 1 次提交
  9. 31 7月, 2009 1 次提交
    • P
      x86, pat: Fix set_memory_wc related corruption · bdc6340f
      Pallipadi, Venkatesh 提交于
      Changeset 3869c4aa
      that went in after 2.6.30-rc1 was a seemingly small change to _set_memory_wc()
      to make it complaint with SDM requirements. But, introduced a nasty bug, which
      can result in crash and/or strange corruptions when set_memory_wc is used.
      One such crash reported here
      http://lkml.org/lkml/2009/7/30/94
      
      Actually, that changeset introduced two bugs.
      * change_page_attr_set() takes &addr as first argument and can the addr value
        might have changed on return, even for single page change_page_attr_set()
        call. That will make the second change_page_attr_set() in this routine
        operate on unrelated addr, that can eventually cause strange corruptions
        and bad page state crash.
      * The second change_page_attr_set() call, before setting _PAGE_CACHE_WC, should
        clear the earlier _PAGE_CACHE_UC_MINUS, as otherwise cache attribute will not
        be WC (will be UC instead).
      
      The patch below fixes both these problems. Sending a single patch to fix both
      the problems, as the change is to the same line of code. The change to have a
      addr_copy is not very clean. But, it is simpler than making more changes
      through various routines in pageattr.c.
      
      A huge thanks to Jerome for reporting this problem and providing a simple test
      case that helped us root cause the problem.
      Reported-by: NJerome Glisse <glisse@freedesktop.org>
      Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <20090730214319.GA1889@linux-os.sc.intel.com>
      Acked-by: NDave Airlie <airlied@redhat.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      bdc6340f
  10. 04 7月, 2009 1 次提交
    • T
      x86,percpu: generalize lpage first chunk allocator · 8c4bfc6e
      Tejun Heo 提交于
      Generalize and move x86 setup_pcpu_lpage() into
      pcpu_lpage_first_chunk().  setup_pcpu_lpage() now is a simple wrapper
      around the generalized version.  Other than taking size parameters and
      using arch supplied callbacks to allocate/free/map memory,
      pcpu_lpage_first_chunk() is identical to the original implementation.
      
      This simplifies arch code and will help converting more archs to
      dynamic percpu allocator.
      
      While at it, factor out pcpu_calc_fc_sizes() which is common to
      pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().
      
      [ Impact: code reorganization and generalization ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      8c4bfc6e
  11. 22 6月, 2009 2 次提交
    • T
      x86: fix pageattr handling for lpage percpu allocator and re-enable it · e59a1bb2
      Tejun Heo 提交于
      lpage allocator aliases a PMD page for each cpu and returns whatever
      is unused to the page allocator.  When the pageattr of the recycled
      pages are changed, this makes the two aliases point to the overlapping
      regions with different attributes which isn't allowed and known to
      cause subtle data corruption in certain cases.
      
      This can be handled in simliar manner to the x86_64 highmap alias.
      pageattr code should detect if the target pages have PMD alias and
      split the PMD alias and synchronize the attributes.
      
      pcpur allocator is updated to keep the allocated PMD pages map sorted
      in ascending address order and provide pcpu_lpage_remapped() function
      which binary searches the array to determine whether the given address
      is aliased and if so to which address.  pageattr is updated to use
      pcpu_lpage_remapped() to detect the PMD alias and split it up as
      necessary from cpa_process_alias().
      
      Jan Beulich spotted the original problem and incorrect usage of vaddr
      instead of laddr for lookup.
      
      With this, lpage percpu allocator should work correctly.  Re-enable
      it.
      
      [ Impact: fix subtle lpage pageattr bug and re-enable lpage ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      e59a1bb2
    • T
      x86: reorganize cpa_process_alias() · 992f4c1c
      Tejun Heo 提交于
      Reorganize cpa_process_alias() so that new alias condition can be
      added easily.
      
      Jan Beulich spotted problem in the original cleanup thread which
      incorrectly assumed the two existing conditions were mutially
      exclusive.
      
      [ Impact: code reorganization ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      992f4c1c
  12. 15 6月, 2009 1 次提交
  13. 27 5月, 2009 1 次提交
  14. 23 5月, 2009 2 次提交
  15. 10 4月, 2009 3 次提交
  16. 30 3月, 2009 1 次提交
  17. 20 3月, 2009 3 次提交
  18. 15 3月, 2009 1 次提交
    • J
      x86: add brk allocation for very, very early allocations · 93dbda7c
      Jeremy Fitzhardinge 提交于
      Impact: new interface
      
      Add a brk()-like allocator which effectively extends the bss in order
      to allow very early code to do dynamic allocations.  This is better than
      using statically allocated arrays for data in subsystems which may never
      get used.
      
      The space for brk allocations is in the bss ELF segment, so that the
      space is mapped properly by the code which maps the kernel, and so
      that bootloaders keep the space free rather than putting a ramdisk or
      something into it.
      
      The bss itself, delimited by __bss_stop, ends before the brk area
      (__brk_base to __brk_limit).  The kernel text, data and bss is reserved
      up to __bss_stop.
      
      Any brk-allocated data is reserved separately just before the kernel
      pagetable is built, as that code allocates from unreserved spaces
      in the e820 map, potentially allocating from any unused brk memory.
      Ultimately any unused memory in the brk area is used in the general
      kernel memory pool.
      
      Initially the brk space is set to 1MB, which is probably much larger
      than any user needs (the largest current user is i386 head_32.S's code
      to build the pagetables to map the kernel, which can get fairly large
      with a big kernel image and no PSE support).  So long as the system
      has sufficient memory for the bootloader to reserve the kernel+1MB brk,
      there are no bad effects resulting from an over-large brk.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      93dbda7c
  19. 12 3月, 2009 1 次提交
  20. 21 2月, 2009 1 次提交
    • I
      x86, pat: add large-PAT check to split_large_page() · 7a5714e0
      Ingo Molnar 提交于
      Impact: future-proof the split_large_page() function
      
      Linus noticed that split_large_page() is not safe wrt. the
      PAT bit: it is bit 12 on the 1GB and 2MB page table level
      (_PAGE_BIT_PAT_LARGE), and it is bit 7 on the 4K page
      table level (_PAGE_BIT_PAT).
      
      Currently it is not a problem because we never set
      _PAGE_BIT_PAT_LARGE on any of the large-page mappings - but
      should this happen in the future the split_large_page() would
      silently lift bit 12 into the lowlevel 4K pte and would start
      corrupting the physical page frame offset. Not fun.
      
      So add a debug warning, to make sure if something ever sets
      the PAT bit then this function gets updated too.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7a5714e0
  21. 20 2月, 2009 1 次提交
    • I
      x86: use the right protections for split-up pagetables · 07a66d7c
      Ingo Molnar 提交于
      Steven Rostedt found a bug in where in his modified kernel
      ftrace was unable to modify the kernel text, due to the PMD
      itself having been marked read-only as well in
      split_large_page().
      
      The fix, suggested by Linus, is to not try to 'clone' the
      reference protection of a huge-page, but to use the standard
      (and permissive) page protection bits of KERNPG_TABLE.
      
      The 'cloning' makes sense for the ptes but it's a confused and
      incorrect concept at the page table level - because the
      pagetable entry is a set of all ptes and hence cannot
      'clone' any single protection attribute - the ptes can be any
      mixture of protections.
      
      With the permissive KERNPG_TABLE, even if the pte protections
      get changed after this point (due to ftrace doing code-patching
      or other similar activities like kprobes), the resulting combined
      protections will still be correct and the pte's restrictive
      (or permissive) protections will control it.
      
      Also update the comment.
      
      This bug was there for a long time but has not caused visible
      problems before as it needs a rather large read-only area to
      trigger. Steve possibly hacked his kernel with some really
      large arrays or so. Anyway, the bug is definitely worth fixing.
      
      [ Huang Ying also experienced problems in this area when writing
        the EFI code, but the real bug in split_large_page() was not
        realized back then. ]
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      07a66d7c
  22. 13 2月, 2009 1 次提交
  23. 12 2月, 2009 1 次提交
  24. 21 1月, 2009 1 次提交
    • S
      x86: fix page attribute corruption with cpa() · a1e46212
      Suresh Siddha 提交于
      Impact: fix sporadic slowdowns and warning messages
      
      This patch fixes a performance issue reported by Linus on his
      Nehalem system. While Linus reverted the PAT patch (commit
      58dab916) which exposed the issue,
      existing cpa() code can potentially still cause wrong(page attribute
      corruption) behavior.
      
      This patch also fixes the "WARNING: at arch/x86/mm/pageattr.c:560" that
      various people reported.
      
      In 64bit kernel, kernel identity mapping might have holes depending
      on the available memory and how e820 reports the address range
      covering the RAM, ACPI, PCI reserved regions. If there is a 2MB/1GB hole
      in the address range that is not listed by e820 entries, kernel identity
      mapping will have a corresponding hole in its 1-1 identity mapping.
      
      If cpa() happens on the kernel identity mapping which falls into these holes,
      existing code fails like this:
      
      	__change_page_attr_set_clr()
      		__change_page_attr()
      			returns 0 because of if (!kpte). But doesn't
      			set cpa->numpages and cpa->pfn.
      		cpa_process_alias()
      			uses uninitialized cpa->pfn (random value)
      			which can potentially lead to changing the page
      			attribute of kernel text/data, kernel identity
      			mapping of RAM pages etc. oops!
      
      This bug was easily exposed by another PAT patch which was doing
      cpa() more often on kernel identity mapping holes (physical range between
      max_low_pfn_mapped and 4GB), where in here it was setting the
      cache disable attribute(PCD) for kernel identity mappings aswell.
      
      Fix cpa() to handle the kernel identity mapping holes. Retain
      the WARN() for cpa() calls to other not present address ranges
      (kernel-text/data, ioremap() addresses)
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a1e46212
  25. 16 1月, 2009 1 次提交
  26. 14 1月, 2009 1 次提交
    • V
      x86 PAT: remove CPA WARN_ON for zero pte · 58dab916
      venkatesh.pallipadi@intel.com 提交于
      Impact: reduce scope of debug check - avoid warnings
      
      The logic to find whether identity map exists or not using
      high_memory or max_low_pfn_mapped/max_pfn_mapped are not complete
      as the memory withing the range may not be mapped if there is a
      unusable hole in e820.
      
      Specifically, on my test system I started seeing these warnings with
      tools like hwinfo, acpidump trying to map ACPI region.
      
      [   27.400018] ------------[ cut here ]------------
      [   27.400344] WARNING: at /home/venkip/src/linus/linux-2.6/arch/x86/mm/pageattr.c:560 __change_page_attr_set_clr+0xf3/0x8b8()
      [   27.400821] Hardware name: X7DB8
      [   27.401070] CPA: called for zero pte. vaddr = ffff8800cff6a000 cpa->vaddr = ffff8800cff6a000
      [   27.401569] Modules linked in:
      [   27.401882] Pid: 4913, comm: dmidecode Not tainted 2.6.28-05716-gfe0bdec6 #586
      [   27.402141] Call Trace:
      [   27.402488]  [<ffffffff80237c21>] warn_slowpath+0xd3/0x10f
      [   27.402749]  [<ffffffff80274ade>] ? find_get_page+0xb3/0xc9
      [   27.403028]  [<ffffffff80274a2b>] ? find_get_page+0x0/0xc9
      [   27.403333]  [<ffffffff80226425>] __change_page_attr_set_clr+0xf3/0x8b8
      [   27.403628]  [<ffffffff8028ec99>] ? __purge_vmap_area_lazy+0x192/0x1a1
      [   27.403883]  [<ffffffff8028eb52>] ? __purge_vmap_area_lazy+0x4b/0x1a1
      [   27.404172]  [<ffffffff80290268>] ? vm_unmap_aliases+0x1ab/0x1bb
      [   27.404512]  [<ffffffff80290105>] ? vm_unmap_aliases+0x48/0x1bb
      [   27.404766]  [<ffffffff80226d28>] change_page_attr_set_clr+0x13e/0x2e6
      [   27.405026]  [<ffffffff80698fa7>] ? _spin_unlock+0x26/0x2a
      [   27.405292]  [<ffffffff80227e6a>] ? reserve_memtype+0x19b/0x4e3
      [   27.405590]  [<ffffffff80226ffd>] _set_memory_wb+0x22/0x24
      [   27.405844]  [<ffffffff80225d28>] ioremap_change_attr+0x26/0x28
      [   27.406097]  [<ffffffff80228355>] reserve_pfn_range+0x1a3/0x235
      [   27.406427]  [<ffffffff80228430>] track_pfn_vma_new+0x49/0xb3
      [   27.406686]  [<ffffffff80286c46>] remap_pfn_range+0x94/0x32c
      [   27.406940]  [<ffffffff8022878d>] ? phys_mem_access_prot_allowed+0xb5/0x1a8
      [   27.407209]  [<ffffffff803e9bf4>] mmap_mem+0x75/0x9d
      [   27.407523]  [<ffffffff8028b3b4>] mmap_region+0x2cf/0x53e
      [   27.407776]  [<ffffffff8028b8cc>] do_mmap_pgoff+0x2a9/0x30d
      [   27.408034]  [<ffffffff8020f4a4>] sys_mmap+0x92/0xce
      [   27.408339]  [<ffffffff8020b65b>] system_call_fastpath+0x16/0x1b
      [   27.408614] ---[ end trace 4b16ad70c09a602d ]---
      [   27.408871] dmidecode:4913 reserve_pfn_range ioremap_change_attr failed write-back for cff6a000-cff6b000
      
      This is wih track_pfn_vma_new trying to keep identity map in sync.
      The address cff6a000 is the ACPI region according to e820.
      
      [    0.000000] BIOS-provided physical RAM map:
      [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009c000 (usable)
      [    0.000000]  BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved)
      [    0.000000]  BIOS-e820: 00000000000cc000 - 00000000000d0000 (reserved)
      [    0.000000]  BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
      [    0.000000]  BIOS-e820: 0000000000100000 - 00000000cff60000 (usable)
      [    0.000000]  BIOS-e820: 00000000cff60000 - 00000000cff69000 (ACPI data)
      [    0.000000]  BIOS-e820: 00000000cff69000 - 00000000cff80000 (ACPI NVS)
      [    0.000000]  BIOS-e820: 00000000cff80000 - 00000000d0000000 (reserved)
      [    0.000000]  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
      [    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
      [    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
      [    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
      [    0.000000]  BIOS-e820: 0000000100000000 - 0000000230000000 (usable)
      
      And is not mapped as per init_memory_mapping.
      
      [    0.000000] init_memory_mapping: 0000000000000000-00000000cff60000
      [    0.000000] init_memory_mapping: 0000000100000000-0000000230000000
      
      We can add logic to check for this. But, there can also be other holes in
      identity map when we have 1GB of aligned reserved space in e820.
      
      This patch handles it by removing the WARN_ON and returning a specific
      error value (EFAULT) to indicate that the address does not have any
      identity mapping.
      
      The code that tries to keep identity map in sync can ignore
      this error, with other callers of cpa still getting error here.
      Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      58dab916
  27. 06 11月, 2008 1 次提交
  28. 23 10月, 2008 1 次提交
  29. 20 10月, 2008 1 次提交
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
  30. 11 10月, 2008 3 次提交
    • S
      x86, cpa: srlz cpa(), global flush tlb after splitting big page and before doing cpa · ad5ca55f
      Suresh Siddha 提交于
      Do a global flush tlb after splitting the large page and before we do the
      actual change page attribute in the PTE.
      
      With out this, we violate the TLB application note, which says
          "The TLBs may contain both ordinary and large-page translations for
           a 4-KByte range of linear addresses. This may occur if software
           modifies the paging structures so that the page size used for the
           address range changes. If the two translations differ with respect
           to page frame or attributes (e.g., permissions), processor behavior
           is undefined and may be implementation-specific."
      
      And also serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity
      mappings) using cpa_lock. So that we don't allow any other cpu, with stale
      large tlb entries change the page attribute in parallel to some other cpu
      splitting a large page entry along with changing the attribute.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad5ca55f
    • S
      x86, cpa: remove cpa pool code · 8311eb84
      Suresh Siddha 提交于
      Interrupt context no longer splits large page in cpa(). So we can do away
      with cpa memory pool code.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8311eb84
    • S
      x86, cpa: no need to check alias for __set_pages_p/__set_pages_np · 55121b43
      Suresh Siddha 提交于
      No alias checking needed for setting present/not-present mapping. Otherwise,
      we may need to break large pages for 64-bit kernel text mappings (this adds to
      complexity if we want to do this from atomic context especially, for ex:
      with CONFIG_DEBUG_PAGEALLOC). Let's keep it simple!
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: arjan@linux.intel.com
      Cc: venkatesh.pallipadi@intel.com
      Cc: jeremy@goop.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      55121b43
  31. 30 9月, 2008 1 次提交