1. 27 8月, 2009 7 次提交
  2. 01 7月, 2009 1 次提交
    • Y
      x86: only clear node_states for 64bit · 66918dcd
      Yinghai Lu 提交于
      Nathan reported that
      
      | commit 73d60b7f
      | Author: Yinghai Lu <yinghai@kernel.org>
      | Date:   Tue Jun 16 15:33:00 2009 -0700
      |
      |    page-allocator: clear N_HIGH_MEMORY map before we set it again
      |
      |    SRAT tables may contains nodes of very small size.  The arch code may
      |    decide to not activate such a node.  However, currently the early boot
      |    code sets N_HIGH_MEMORY for such nodes.  These nodes therefore seem to be
      |    active although these nodes have no present pages.
      |
      |    For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too
      
      unintentionally and incorrectly clears the cpuset.mems cgroup attribute on
      an i386 kvm guest, meaning that cpuset.mems can not be used.
      
      Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
      and need to do save/restore for that in find_zone_movable_pfn
      Reported-by: NNathan Lynch <ntl@pobox.com>
      Tested-by: NNathan Lynch <ntl@pobox.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66918dcd
  3. 23 6月, 2009 1 次提交
    • P
      x86: Move init_gbpages() to setup_arch() · 854c879f
      Pekka J Enberg 提交于
      The init_gbpages() function is conditionally called from
      init_memory_mapping() function. There are two call-sites where
      this 'after_bootmem' condition can be true: setup_arch() and
      mem_init() via pci_iommu_alloc().
      
      Therefore, it's safe to move the call to init_gbpages() to
      setup_arch() as it's always called before mem_init().
      
      This removes an after_bootmem use - paving the way to remove
      all uses of that state variable.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <Pine.LNX.4.64.0906221731210.19474@melkki.cs.Helsinki.FI>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      854c879f
  4. 22 6月, 2009 3 次提交
    • T
      x86: fix pageattr handling for lpage percpu allocator and re-enable it · e59a1bb2
      Tejun Heo 提交于
      lpage allocator aliases a PMD page for each cpu and returns whatever
      is unused to the page allocator.  When the pageattr of the recycled
      pages are changed, this makes the two aliases point to the overlapping
      regions with different attributes which isn't allowed and known to
      cause subtle data corruption in certain cases.
      
      This can be handled in simliar manner to the x86_64 highmap alias.
      pageattr code should detect if the target pages have PMD alias and
      split the PMD alias and synchronize the attributes.
      
      pcpur allocator is updated to keep the allocated PMD pages map sorted
      in ascending address order and provide pcpu_lpage_remapped() function
      which binary searches the array to determine whether the given address
      is aliased and if so to which address.  pageattr is updated to use
      pcpu_lpage_remapped() to detect the PMD alias and split it up as
      necessary from cpa_process_alias().
      
      Jan Beulich spotted the original problem and incorrect usage of vaddr
      instead of laddr for lookup.
      
      With this, lpage percpu allocator should work correctly.  Re-enable
      it.
      
      [ Impact: fix subtle lpage pageattr bug and re-enable lpage ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      e59a1bb2
    • T
      x86: reorganize cpa_process_alias() · 992f4c1c
      Tejun Heo 提交于
      Reorganize cpa_process_alias() so that new alias condition can be
      added easily.
      
      Jan Beulich spotted problem in the original cleanup thread which
      incorrectly assumed the two existing conditions were mutially
      exclusive.
      
      [ Impact: code reorganization ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      992f4c1c
    • L
      Move FAULT_FLAG_xyz into handle_mm_fault() callers · d06063cc
      Linus Torvalds 提交于
      This allows the callers to now pass down the full set of FAULT_FLAG_xyz
      flags to handle_mm_fault().  All callers have been (mechanically)
      converted to the new calling convention, there's almost certainly room
      for architectures to clean up their code and then add FAULT_FLAG_RETRY
      when that support is added.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d06063cc
  5. 21 6月, 2009 1 次提交
    • L
      x86: don't use 'access_ok()' as a range check in get_user_pages_fast() · 7f818906
      Linus Torvalds 提交于
      It's really not right to use 'access_ok()', since that is meant for the
      normal "get_user()" and "copy_from/to_user()" accesses, which are done
      through the TLB, rather than through the page tables.
      
      Why? access_ok() does both too few, and too many checks.  Too many,
      because it is meant for regular kernel accesses that will not honor the
      'user' bit in the page tables, and because it honors the USER_DS vs
      KERNEL_DS distinction that we shouldn't care about in GUP.  And too few,
      because it doesn't do the 'canonical' check on the address on x86-64,
      since the TLB will do that for us.
      
      So instead of using a function that isn't meant for this, and does
      something else and much more complicated, just do the real rules: we
      don't want the range to overflow, and on x86-64, we want it to be a
      canonical low address (on 32-bit, all addresses are canonical).
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f818906
  6. 19 6月, 2009 1 次提交
    • I
      perf_counter, x86: Improve interactions with fast-gup · 0c871971
      Ingo Molnar 提交于
      Improve a few details in perfcounter call-chain recording that
      makes use of fast-GUP:
      
      - Use ACCESS_ONCE() to observe the pte value. ptes are fundamentally
        racy and can be changed on another CPU, so we have to be careful
        about how we access them. The PAE branch is already careful with
        read-barriers - but the non-PAE and 64-bit side needs an
        ACCESS_ONCE() to make sure the pte value is observed only once.
      
      - make the checks a bit stricter so that we can feed it any kind of
        cra^H^H^H user-space input ;-)
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0c871971
  7. 16 6月, 2009 1 次提交
    • I
      x86: mm: Read cr2 before prefetching the mmap_lock · 5dfaf90f
      Ingo Molnar 提交于
      Prefetch instructions can generate spurious faults on certain
      models of older CPUs. The faults themselves cannot be stopped
      and they can occur pretty much anywhere - so the way we solve
      them is that we detect certain patterns and ignore the fault.
      
      There is one small path of code where we must not take faults
      though: the #PF handler execution leading up to the reading
      of the CR2 (the faulting address). If we take a fault there
      then we destroy the CR2 value (with that of the prefetching
      instruction's) and possibly mishandle user-space or
      kernel-space pagefaults.
      
      It turns out that in current upstream we do exactly that:
      
      	prefetchw(&mm->mmap_sem);
      
      	/* Get the faulting address: */
      	address = read_cr2();
      
      This is not good.
      
      So turn around the order: first read the cr2 then prefetch
      the lock address. Reading cr2 is plenty fast (2 cycles) so
      delaying the prefetch by this amount shouldnt be a big issue
      performance-wise.
      
      [ And this might explain a mystery fault.c warning that sometimes
        occurs on one an old AMD/Semptron based test-system i have -
        which does have such prefetch problems. ]
      
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      LKML-Reference: <20090616030522.GA22162@Krystal>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5dfaf90f
  8. 15 6月, 2009 10 次提交
  9. 13 6月, 2009 2 次提交
    • R
      kmemcheck: include module.h to prevent warnings · 60e38393
      Randy Dunlap 提交于
      kmemcheck/shadow.c needs to include <linux/module.h> to prevent
      the following warnings:
      
      linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : data definition has no type or storage class
      linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
      linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : parameter names (without types) in function declaration
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: vegardno@ifi.uio.no
      Cc: penberg@cs.helsinki.fi
      Cc: akpm <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      60e38393
    • V
      kmemcheck: add the kmemcheck core · dfec072e
      Vegard Nossum 提交于
      General description: kmemcheck is a patch to the linux kernel that
      detects use of uninitialized memory. It does this by trapping every
      read and write to memory that was allocated dynamically (e.g. using
      kmalloc()). If a memory address is read that has not previously been
      written to, a message is printed to the kernel log.
      
      Thanks to Andi Kleen for the set_memory_4k() solution.
      
      Andrew Morton suggested documenting the shadow member of struct page.
      Signed-off-by: NVegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      
      [export kmemcheck_mark_initialized]
      [build fix for setup_max_cpus]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
      [rebased for mainline inclusion]
      Signed-off-by: NVegard Nossum <vegardno@ifi.uio.no>
      dfec072e
  10. 12 6月, 2009 2 次提交
    • S
      x86: change kernel_physical_mapping_init() __init to __meminit · 41d840e2
      Shaohua Li 提交于
      kernel_physical_mapping_init() could be called in memory hotplug path.
      
      [ Impact: fix potential crash with memory hotplug ]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      LKML-Reference: <20090612045752.GA827@sli10-desk.sh.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      41d840e2
    • Y
      x86: make zap_low_mapping could be used early · 55cd6367
      Yinghai Lu 提交于
      Only one cpu is there, just call __flush_tlb for it. Fixes the following boot
      warning on x86:
      
        [    0.000000] Memory: 885032k/915540k available (5993k kernel code, 29844k reserved, 3842k data, 428k init, 0k highmem)
        [    0.000000] virtual kernel memory layout:
        [    0.000000]     fixmap  : 0xffe17000 - 0xfffff000   (1952 kB)
        [    0.000000]     vmalloc : 0xf8615000 - 0xffe15000   ( 120 MB)
        [    0.000000]     lowmem  : 0xc0000000 - 0xf7e15000   ( 894 MB)
        [    0.000000]       .init : 0xc19a5000 - 0xc1a10000   ( 428 kB)
        [    0.000000]       .data : 0xc15da4bb - 0xc199af6c   (3842 kB)
        [    0.000000]       .text : 0xc1000000 - 0xc15da4bb   (5993 kB)
        [    0.000000] Checking if this processor honours the WP bit even in supervisor mode...Ok.
        [    0.000000] ------------[ cut here ]------------
        [    0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x50/0x1b0()
        [    0.000000] Hardware name: System Product Name
        [    0.000000] Modules linked in:
        [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip #52504
        [    0.000000] Call Trace:
        [    0.000000]  [<c104aa16>] warn_slowpath_common+0x65/0x95
        [    0.000000]  [<c104aa58>] warn_slowpath_null+0x12/0x15
        [    0.000000]  [<c1073bbe>] smp_call_function_many+0x50/0x1b0
        [    0.000000]  [<c1037615>] ? do_flush_tlb_all+0x0/0x41
        [    0.000000]  [<c1037615>] ? do_flush_tlb_all+0x0/0x41
        [    0.000000]  [<c1073d4f>] smp_call_function+0x31/0x58
        [    0.000000]  [<c1037615>] ? do_flush_tlb_all+0x0/0x41
        [    0.000000]  [<c104f635>] on_each_cpu+0x26/0x65
        [    0.000000]  [<c10374b5>] flush_tlb_all+0x19/0x1b
        [    0.000000]  [<c1032ab3>] zap_low_mappings+0x4d/0x56
        [    0.000000]  [<c15d64b5>] ? printk+0x14/0x17
        [    0.000000]  [<c19b42a8>] mem_init+0x23d/0x245
        [    0.000000]  [<c19a56a1>] start_kernel+0x17a/0x2d5
        [    0.000000]  [<c19a5347>] ? unknown_bootoption+0x0/0x19a
        [    0.000000]  [<c19a5039>] __init_begin+0x39/0x41
        [    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      55cd6367
  11. 11 6月, 2009 2 次提交
  12. 09 6月, 2009 1 次提交
  13. 29 5月, 2009 1 次提交
    • M
      x86: ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not · 32b154c0
      Mel Gorman 提交于
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302
      
      On x86 and x86-64, it is possible that page tables are shared beween
      shared mappings backed by hugetlbfs.  As part of this,
      page_table_shareable() checks a pair of vma->vm_flags and they must match
      if they are to be shared.  All VMA flags are taken into account, including
      VM_LOCKED.
      
      The problem is that VM_LOCKED is cleared on fork().  When a process with a
      shared memory segment forks() to exec() a helper, there will be shared
      VMAs with different flags.  The impact is that the shared segment is
      sometimes considered shareable and other times not, depending on what
      process is checking.
      
      What happens is that the segment page tables are being shared but the
      count is inaccurate depending on the ordering of events.  As the page
      tables are freed with put_page(), bad pmd's are found when some of the
      children exit.  The hugepage counters also get corrupted and the Total and
      Free count will no longer match even when all the hugepage-backed regions
      are freed.  This requires a reboot of the machine to "fix".
      
      This patch addresses the problem by comparing all flags except VM_LOCKED
      when deciding if pagetables should be shared or not for hugetlbfs-backed
      mapping.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <starlight@binnacle.cx>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32b154c0
  14. 27 5月, 2009 1 次提交
  15. 23 5月, 2009 2 次提交
  16. 18 5月, 2009 2 次提交
    • Y
      x86, mm: Fix node_possible_map logic · 7c43769a
      Yinghai Lu 提交于
      Recently there were some changes to the meaning of node_possible_map,
      and it is quite strange:
      
      - the node without memory would be set in node_possible_map
      - but some node with less NODE_MIN_SIZE will be kicked out of node_possible_map.
      
      fix it by adding strict_setup_node_bootmem().
      
      Also, remove unparse_node().
      
      so result will be:
      
      1. cpu_to_node() will return online node only (nearest one)
      2. apicid_to_node() still returns the node that could be not online but is set
         in node_possible_map.
      3. node_possible_map will include nodes that mem on it are less NODE_MIN_SIZE
      
      v2: after move_cpus_to_node change.
      
      [ Impact: get node_possible_map right ]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Tested-by: NJack Steiner <steiner@sgi.com>
      LKML-Reference: <4A0C49BE.6080800@kernel.org>
      [ v3: various small cleanups and comment clarifications ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7c43769a
    • Y
      mm, x86: remove MEMORY_HOTPLUG_RESERVE related code · 888a589f
      Yinghai Lu 提交于
      after:
      
       | commit b263295d
       | Author: Christoph Lameter <clameter@sgi.com>
       | Date:   Wed Jan 30 13:30:47 2008 +0100
       |
       |    x86: 64-bit, make sparsemem vmemmap the only memory model
      
      we don't have MEMORY_HOTPLUG_RESERVE anymore.
      
      Historically, x86-64 had an architecture-specific method for memory hotplug
      whereby it scanned the SRAT for physical memory ranges that could be
      potentially used for memory hot-add later. By reserving those ranges
      without physical memory, the memmap would be allocated and left dormant
      until needed. This depended on the DISCONTIG memory model which has been
      removed so the code implementing HOTPLUG_RESERVE is now dead.
      
      This patch removes the dead code used by MEMORY_HOTPLUG_RESERVE.
      
      (Changelog authored by Mel.)
      
      v2: updated changelog, and remove hotadd= in doc
      
      [ Impact: remove dead code ]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NMel Gorman <mel@csn.ul.ie>
      Workflow-found-OK-by: NAndrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4A0C4910.7090508@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      888a589f
  17. 12 5月, 2009 1 次提交
  18. 11 5月, 2009 1 次提交