1. 11 11月, 2011 6 次提交
  2. 03 11月, 2011 2 次提交
    • A
      thp: share get_huge_page_tail() · b35a35b5
      Andrea Arcangeli 提交于
      This avoids duplicating the function in every arch gup_fast.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b35a35b5
    • A
      mm: thp: tail page refcounting fix · 70b50f94
      Andrea Arcangeli 提交于
      Michel while working on the working set estimation code, noticed that
      calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
      wasn't safe, if the pfn ended up being a tail page of a transparent
      hugepage under splitting by __split_huge_page_refcount().
      
      He then found the problem could also theoretically materialize with
      page_cache_get_speculative() during the speculative radix tree lookups
      that uses get_page_unless_zero() in SMP if the radix tree page is freed
      and reallocated and get_user_pages is called on it before
      page_cache_get_speculative has a chance to call get_page_unless_zero().
      
      So the best way to fix the problem is to keep page_tail->_count zero at
      all times.  This will guarantee that get_page_unless_zero() can never
      succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
      is unused for all tail pages of a compound page, so we can simply
      account the tail page references there and transfer them to
      tail_page->_count in __split_huge_page_refcount() (in addition to the
      head_page->_mapcount).
      
      While debugging this s/_count/_mapcount/ change I also noticed get_page is
      called by direct-io.c on pages returned by get_user_pages.  That wasn't
      entirely safe because the two atomic_inc in get_page weren't atomic.  As
      opposed to other get_user_page users like secondary-MMU page fault to
      establish the shadow pagetables would never call any superflous get_page
      after get_user_page returns.  It's safer to make get_page universally safe
      for tail pages and to use get_page_foll() within follow_page (inside
      get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
      pages without taking any locks because it is run within PT lock protected
      critical sections (PT lock for pte and page_table_lock for
      pmd_trans_huge).
      
      The standard get_page() as invoked by direct-io instead will now take
      the compound_lock but still only for tail pages.  The direct-io paths
      are usually I/O bound and the compound_lock is per THP so very
      finegrined, so there's no risk of scalability issues with it.  A simple
      direct-io benchmarks with all lockdep prove locking and spinlock
      debugging infrastructure enabled shows identical performance and no
      overhead.  So it's worth it.  Ideally direct-io should stop calling
      get_page() on pages returned by get_user_pages().  The spinlock in
      get_page() is already optimized away for no-THP builds but doing
      get_page() on tail pages returned by GUP is generally a rare operation
      and usually only run in I/O paths.
      
      This new refcounting on page_tail->_mapcount in addition to avoiding new
      RCU critical sections will also allow the working set estimation code to
      work without any further complexity associated to the tail page
      refcounting with THP.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b50f94
  3. 24 10月, 2011 1 次提交
    • T
      x86: Fix S4 regression · 8548c84d
      Takashi Iwai 提交于
      Commit 4b239f45 ("x86-64, mm: Put early page table high") causes a S4
      regression since 2.6.39, namely the machine reboots occasionally at S4
      resume.  It doesn't happen always, overall rate is about 1/20.  But,
      like other bugs, once when this happens, it continues to happen.
      
      This patch fixes the problem by essentially reverting the memory
      assignment in the older way.
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Cc: <stable@kernel.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Yinghai Lu <yinghai.lu@oracle.com>
      [ We'll hopefully find the real fix, but that's too late for 3.1 now ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8548c84d
  4. 29 9月, 2011 1 次提交
    • J
      x86-64: Don't apply destructive erratum workaround on unaffected CPUs · e05139f2
      Jan Beulich 提交于
      Erratum 93 applies to AMD K8 CPUs only, and its workaround
      (forcing the upper 32 bits of %rip to all get set under certain
      conditions) is actually getting in the way of analyzing page
      faults occurring during EFI physical mode runtime calls (in
      particular the page table walk shown is completely unrelated to
      the actual fault). This is because typically EFI runtime code
      lives in the space between 2G and 4G, which - modulo the above
      manipulation - is likely to overlap with the kernel or modules
      area.
      
      While even for the other errata workarounds their taking effect
      could be limited to just the affected CPUs, none of them appears
      to be destructive, and they're generally getting called only
      outside of performance critical paths, so they're being left
      untouched.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Link: http://lkml.kernel.org/r/4E835FE30200007800058464@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e05139f2
  5. 15 9月, 2011 1 次提交
  6. 23 8月, 2011 1 次提交
  7. 16 8月, 2011 2 次提交
  8. 11 8月, 2011 1 次提交
  9. 07 8月, 2011 1 次提交
  10. 06 8月, 2011 1 次提交
    • B
      x86, amd: Avoid cache aliasing penalties on AMD family 15h · dfb09f9b
      Borislav Petkov 提交于
      This patch provides performance tuning for the "Bulldozer" CPU. With its
      shared instruction cache there is a chance of generating an excessive
      number of cache cross-invalidates when running specific workloads on the
      cores of a compute module.
      
      This excessive amount of cross-invalidations can be observed if cache
      lines backed by shared physical memory alias in bits [14:12] of their
      virtual addresses, as those bits are used for the index generation.
      
      This patch addresses the issue by clearing all the bits in the [14:12]
      slice of the file mapping's virtual address at generation time, thus
      forcing those bits the same for all mappings of a single shared library
      across processes and, in doing so, avoids instruction cache aliases.
      
      It also adds the command line option "align_va_addr=(32|64|on|off)" with
      which virtual address alignment can be enabled for 32-bit or 64-bit x86
      individually, or both, or be completely disabled.
      
      This change leaves virtual region address allocation on other families
      and/or vendors unaffected.
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      Link: http://lkml.kernel.org/r/1312550110-24160-2-git-send-email-bp@amd64.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      dfb09f9b
  11. 05 8月, 2011 1 次提交
  12. 27 7月, 2011 1 次提交
  13. 13 7月, 2011 2 次提交
    • T
      x86, numa: Implement pfn -> nid mapping granularity check · 1e01979c
      Tejun Heo 提交于
      SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
      sections array to map pfn to nid which is limited in granularity.  If
      NUMA nodes are laid out such that the mapping cannot be accurate, boot
      will fail triggering BUG_ON() in mminit_verify_page_links().
      
      On 32bit, it's 512MiB w/ PAE and SPARSEMEM.  This seems to have been
      granular enough until commit 2706a0bf (x86, NUMA: Enable
      CONFIG_AMD_NUMA on 32bit too).  Apparently, there is a machine which
      aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT.  This
      led to the following BUG_ON().
      
       On node 0 totalpages: 2096615
         DMA zone: 32 pages used for memmap
         DMA zone: 0 pages reserved
         DMA zone: 3927 pages, LIFO batch:0
         Normal zone: 1740 pages used for memmap
         Normal zone: 220978 pages, LIFO batch:31
         HighMem zone: 16405 pages used for memmap
         HighMem zone: 1853533 pages, LIFO batch:31
       BUG: Int 6: CR2   (null)
            EDI   (null)  ESI 00000002  EBP 00000002  ESP c1543ecc
            EBX f2400000  EDX 00000006  ECX   (null)  EAX 00000001
            err   (null)  EIP c16209aa   CS 00000060  flg 00010002
       Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
                (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe   (null)
              f7200b80 c16395f0 00200a02 f7200a80   (null) 000375fe 00000002   (null)
       Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0bf #17
       Call Trace:
        [<c136b1e5>] ? early_fault+0x2e/0x2e
        [<c16209aa>] ? mminit_verify_page_links+0x12/0x42
        [<c1620613>] ? memmap_init_zone+0xaf/0x10c
        [<c1620929>] ? free_area_init_node+0x2b9/0x2e3
        [<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
        [<c1601d80>] ? paging_init+0x112/0x118
        [<c15f578d>] ? setup_arch+0x791/0x82f
        [<c15f43d9>] ? start_kernel+0x6a/0x257
      
      This patch implements node_map_pfn_alignment() which determines
      maximum internode alignment and update numa_register_memblks() to
      reject NUMA configuration if alignment exceeds the pfn -> nid mapping
      granularity of the memory model as determined by PAGES_PER_SECTION.
      
      This makes the problematic machine boot w/ flatmem by rejecting the
      NUMA config and provides protection against crazy NUMA configurations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
      LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com>
      Reported-and-Tested-by: NHans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Conny Seidel <conny.seidel@amd.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      1e01979c
    • T
      x86, mm: s/PAGES_PER_ELEMENT/PAGES_PER_SECTION/ · d0ead157
      Tejun Heo 提交于
      DISCONTIGMEM on x86-32 implements pfn -> nid mapping similarly to
      SPARSEMEM; however, it calls each mapping unit ELEMENT instead of
      SECTION.  This patch renames it to SECTION so that PAGES_PER_SECTION
      is valid for both DISCONTIGMEM and SPARSEMEM.  This will be used by
      the next patch to implement mapping granularity check.
      
      This patch is trivial constant rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110712074422.GA2872@htj.dyndns.org
      Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      d0ead157
  14. 12 7月, 2011 1 次提交
  15. 01 7月, 2011 1 次提交
    • P
      perf: Remove the nmi parameter from the swevent and overflow interface · a8b0ca17
      Peter Zijlstra 提交于
      The nmi parameter indicated if we could do wakeups from the current
      context, if not, we would set some state and self-IPI and let the
      resulting interrupt do the wakeup.
      
      For the various event classes:
      
        - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
          the PMI-tail (ARM etc.)
        - tracepoint: nmi=0; since tracepoint could be from NMI context.
        - software: nmi=[0,1]; some, like the schedule thing cannot
          perform wakeups, and hence need 0.
      
      As one can see, there is very little nmi=1 usage, and the down-side of
      not using it is that on some platforms some software events can have a
      jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).
      
      The up-side however is that we can remove the nmi parameter and save a
      bunch of conditionals in fast paths.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Michael Cree <mcree@orcon.net.nz>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a8b0ca17
  16. 19 6月, 2011 1 次提交
  17. 15 6月, 2011 1 次提交
  18. 29 5月, 2011 1 次提交
  19. 26 5月, 2011 1 次提交
  20. 25 5月, 2011 3 次提交
  21. 22 5月, 2011 1 次提交
  22. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  23. 17 5月, 2011 1 次提交
  24. 13 5月, 2011 2 次提交
    • S
      x86/mm: Fix section mismatch derived from native_pagetable_reserve() · 53f8023f
      Sedat Dilek 提交于
      With CONFIG_DEBUG_SECTION_MISMATCH=y I see these warnings in next-20110415:
      
        LD      vmlinux.o
        MODPOST vmlinux.o
      WARNING: vmlinux.o(.text+0x1ba48): Section mismatch in reference from the function native_pagetable_reserve() to the function .init.text:memblock_x86_reserve_range()
      The function native_pagetable_reserve() references
      the function __init memblock_x86_reserve_range().
      This is often because native_pagetable_reserve lacks a __init
      annotation or the annotation of memblock_x86_reserve_range is wrong.
      
      This patch fixes the issue.
      Thanks to pipacs from PaX project for help on IRC.
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      53f8023f
    • S
      x86,xen: introduce x86_init.mapping.pagetable_reserve · 279b706b
      Stefano Stabellini 提交于
      Introduce a new x86_init hook called pagetable_reserve that at the end
      of init_memory_mapping is used to reserve a range of memory addresses for
      the kernel pagetable pages we used and free the other ones.
      
      On native it just calls memblock_x86_reserve_range while on xen it also
      takes care of setting the spare memory previously allocated
      for kernel pagetable pages from RO to RW, so that it can be used for
      other purposes.
      
      A detailed explanation of the reason why this hook is needed follows.
      
      As a consequence of the commit:
      
      commit 4b239f45
      Author: Yinghai Lu <yinghai@kernel.org>
      Date:   Fri Dec 17 16:58:28 2010 -0800
      
          x86-64, mm: Put early page table high
      
      at some point init_memory_mapping is going to reach the pagetable pages
      area and map those pages too (mapping them as normal memory that falls
      in the range of addresses passed to init_memory_mapping as argument).
      Some of those pages are already pagetable pages (they are in the range
      pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
      everything is fine.
      Some of these pages are not pagetable pages yet (they fall in the range
      pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
      are going to be mapped RW.  When these pages become pagetable pages and
      are hooked into the pagetable, xen will find that the guest has already
      a RW mapping of them somewhere and fail the operation.
      The reason Xen requires pagetables to be RO is that the hypervisor needs
      to verify that the pagetables are valid before using them. The validation
      operations are called "pinning" (more details in arch/x86/xen/mmu.c).
      
      In order to fix the issue we mark all the pages in the entire range
      pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
      is completed only the range pgt_buf_start-pgt_buf_end is reserved by
      init_memory_mapping. Hence the kernel is going to crash as soon as one
      of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
      ranges are RO).
      
      For this reason we need a hook to reserve the kernel pagetable pages we
      used and free the other ones so that they can be reused for other
      purposes.
      On native it just means calling memblock_x86_reserve_range, on Xen it
      also means marking RW the pagetable pages that we allocated before but
      that haven't been used before.
      
      Another way to fix this is without using the hook is by adding a 'if
      (xen_pv_domain)' in the 'init_memory_mapping' code and calling the Xen
      counterpart, but that is just nasty.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      279b706b
  25. 02 5月, 2011 5 次提交
    • Y
      x86, NUMA: Trim numa meminfo with max_pfn in a separate loop · e5a10c1b
      Yinghai Lu 提交于
      During testing 32bit numa unifying code from tj, found one system with
      more than 64g fails to use numa.  It turns out we do not trim numa
      meminfo correctly against max_pfn in case start address of a node is
      higher than 64GiB.  Bug fix made it to tip tree.
      
      This patch moves the checking and trimming to a separate loop.  So we
      don't need to compare low/high in following merge loops.  It makes the
      code more readable.
      
      Also it makes the node merge printouts less strange.  On a 512GiB numa
      system with 32bit,
      
      before:
      > NUMA: Node 0 [0,a0000) + [100000,80000000) -> [0,80000000)
      > NUMA: Node 0 [0,80000000) + [100000000,1080000000) -> [0,1000000000)
      
      after:
      > NUMA: Node 0 [0,a0000) + [100000,80000000) -> [0,80000000)
      > NUMA: Node 0 [0,80000000) + [100000000,1000000000) -> [0,1000000000)
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      [Updated patch description and comment slightly.]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e5a10c1b
    • Y
      x86, NUMA: Rename setup_node_bootmem() to setup_node_data() · a56bca80
      Yinghai Lu 提交于
      After using memblock to replace bootmem, that function only sets up
      node_data now.
      
      Change the name to reflect what it actually does.
      
      tj: Minor adjustment to the patch description.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a56bca80
    • T
      x86, NUMA: Enable emulation on 32bit too · 1b7e03ef
      Tejun Heo 提交于
      Now that NUMA init path is unified, NUMA emulation can be enabled on
      32bit.  Make numa_emluation.c safe on 32bit by doing the followings.
      
      * Define MAX_DMA32_PFN on 32bit too.
      
      * Include bootmem.h for max_pfn declaration.
      
      * Use u64 explicitly and always use PFN_PHYS() when converting page
        number to address.
      
      * Avoid __udivdi3() generation on 32bit by doing number of pages
        calculation instead in split_nodes_interleave().
      
      And drop X86_64 dependency from Kconfig.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      1b7e03ef
    • T
      x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too · 2706a0bf
      Tejun Heo 提交于
      Now that NUMA init path is unified, amdtopology can be enabled on
      32bit.  Make amdtopology.c safe on 32bit by explicitly using u64 and
      drop X86_64 dependency from Kconfig.
      
      Inclusion of bootmem.h is added for max_pfn declaration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      2706a0bf
    • T
      x86, NUMA: Rename amdtopology_64.c to amdtopology.c · c6f58878
      Tejun Heo 提交于
      amdtopology is going to be used by 32bit too drop _64 suffix.  This is
      pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      c6f58878