1. 08 1月, 2009 1 次提交
    • A
      resource: allow MMIO exclusivity for device drivers · e8de1481
      Arjan van de Ven 提交于
      Device drivers that use pci_request_regions() (and similar APIs) have a
      reasonable expectation that they are the only ones accessing their device.
      As part of the e1000e hunt, we were afraid that some userland (X or some
      bootsplash stuff) was mapping the MMIO region that the driver thought it
      had exclusively via /dev/mem or via various sysfs resource mappings.
      
      This patch adds the option for device drivers to cause their reserved
      regions to the "banned from /dev/mem use" list, so now both kernel memory
      and device-exclusive MMIO regions are banned.
      NOTE: This is only active when CONFIG_STRICT_DEVMEM is set.
      
      In addition to the config option, a kernel parameter iomem=relaxed is
      provided for the cases where developers want to diagnose, in the field,
      drivers issues from userspace.
      Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      e8de1481
  2. 07 1月, 2009 2 次提交
    • G
      mm: show node to memory section relationship with symlinks in sysfs · c04fc586
      Gary Hade 提交于
      Show node to memory section relationship with symlinks in sysfs
      
      Add /sys/devices/system/node/nodeX/memoryY symlinks for all
      the memory sections located on nodeX.  For example:
      /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
      indicates that memory section 135 resides on node1.
      
      Also revises documentation to cover this change as well as updating
      Documentation/ABI/testing/sysfs-devices-memory to include descriptions
      of memory hotremove files 'phys_device', 'phys_index', and 'state'
      that were previously not described there.
      
      In addition to it always being a good policy to provide users with
      the maximum possible amount of physical location information for
      resources that can be hot-added and/or hot-removed, the following
      are some (but likely not all) of the user benefits provided by
      this change.
      Immediate:
        - Provides information needed to determine the specific node
          on which a defective DIMM is located.  This will reduce system
          downtime when the node or defective DIMM is swapped out.
        - Prevents unintended onlining of a memory section that was
          previously offlined due to a defective DIMM.  This could happen
          during node hot-add when the user or node hot-add assist script
          onlines _all_ offlined sections due to user or script inability
          to identify the specific memory sections located on the hot-added
          node.  The consequences of reintroducing the defective memory
          could be ugly.
        - Provides information needed to vary the amount and distribution
          of memory on specific nodes for testing or debugging purposes.
      Future:
        - Will provide information needed to identify the memory
          sections that need to be offlined prior to physical removal
          of a specific node.
      
      Symlink creation during boot was tested on 2-node x86_64, 2-node
      ppc64, and 2-node ia64 systems.  Symlink creation during physical
      memory hot-add tested on a 2-node x86_64 system.
      Signed-off-by: NGary Hade <garyhade@us.ibm.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c04fc586
    • N
      mm: invoke oom-killer from page fault · 1c0fe6e3
      Nick Piggin 提交于
      Rather than have the pagefault handler kill a process directly if it gets
      a VM_FAULT_OOM, have it call into the OOM killer.
      
      With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
      oom killing throttling, oom priority adjustment or selective disabling,
      panic on oom, etc), it's silly to unconditionally kill the faulting
      process at page fault time.  Create a hook for pagefault oom path to call
      into instead.
      
      Only converted x86 and uml so far.
      
      [akpm@linux-foundation.org: make __out_of_memory() static]
      [akpm@linux-foundation.org: fix comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeff Dike <jdike@addtoit.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c0fe6e3
  3. 03 1月, 2009 1 次提交
  4. 02 1月, 2009 1 次提交
    • I
      x86: convert permanent_kmaps_init() from macro to inline · a9067d53
      Ingo Brueckl 提交于
      Impact: cleanup
      
      This compiler warning:
      
        arch/x86/mm/init_32.c:515: warning: unused variable 'pgd_base'
      
      triggers because permanent_kmaps_init() is a CPP macro in the
      !CONFIG_HIGHMEM case, that does not tell the compiler that the
      'pgd_base' parameter is used.
      
      Convert permanent_kmaps_init() (and set_highmem_pages_init()) to
      C inline functions - which gives the parameter a proper type and
      which gets rid of the compiler warning as well.
      Signed-off-by: NIngo Brueckl <ib@wupperonline.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a9067d53
  5. 24 12月, 2008 1 次提交
    • H
      x86: PAT: fix address types in track_pfn_vma_new() · c1c15b65
      H. Peter Anvin 提交于
      Impact: cleanup, fix warning
      
      This warning:
      
       arch/x86/mm/pat.c: In function track_pfn_vma_copy:
       arch/x86/mm/pat.c:701: warning: passing argument 5 of follow_phys from incompatible pointer type
      
      Triggers because physical addresses are resource_size_t, not u64.
      
      This really matters when calling an interface like follow_phys() which
      takes a pointer to a physical address -- although on x86, being
      littleendian, it would generally work anyway as long as the memory region
      wasn't completely uninitialized.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c1c15b65
  6. 20 12月, 2008 1 次提交
  7. 19 12月, 2008 2 次提交
  8. 18 12月, 2008 1 次提交
  9. 17 12月, 2008 3 次提交
  10. 12 12月, 2008 1 次提交
  11. 14 11月, 2008 1 次提交
  12. 13 11月, 2008 1 次提交
    • R
      x86, hibernate: fix breakage on x86_32 with CONFIG_NUMA set · 97a70e54
      Rafael J. Wysocki 提交于
      Impact: fix crash during hibernation on 32-bit NUMA
      
      The NUMA code on x86_32 creates special memory mapping that allows
      each node's pgdat to be located in this node's memory.  For this
      purpose it allocates a memory area at the end of each node's memory
      and maps this area so that it is accessible with virtual addresses
      belonging to low memory.  As a result, if there is high memory,
      these NUMA-allocated areas are physically located in high memory,
      although they are mapped to low memory addresses.
      
      Our hibernation code does not take that into account and for this
      reason hibernation fails on all x86_32 systems with CONFIG_NUMA=y and
      with high memory present.  Fix this by adding a special mapping for
      the NUMA-allocated memory areas to the temporary page tables created
      during the last phase of resume.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      97a70e54
  13. 06 11月, 2008 1 次提交
  14. 31 10月, 2008 3 次提交
  15. 29 10月, 2008 4 次提交
    • G
      x86: remove debug code from arch_add_memory() · fe8b868e
      Gary Hade 提交于
      Impact: remove incorrect WARN_ON(1)
      
      Gets rid of dmesg spam created during physical memory hot-add which
      will very likely confuse users.  The change removes what appears to
      be debugging code which I assume was unintentionally included in:
      
        x86: arch/x86/mm/init_64.c printk fixes
        commit 10f22ddeSigned-off-by: NGary Hade <garyhade@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fe8b868e
    • H
      x86: start annotating early ioremap pointers with __iomem · 1d6cf1fe
      Harvey Harrison 提交于
      Impact: some new sparse warnings in e820.c etc, but no functional change.
      
      As with regular ioremap, iounmap etc, annotate with __iomem.
      
      Fixes the following sparse warnings, will produce some new ones
      elsewhere in arch/x86 that will get worked out over time.
      
      arch/x86/mm/ioremap.c:402:9: warning: cast removes address space of expression
      arch/x86/mm/ioremap.c:406:10: warning: cast adds address space to expression (<asn:2>)
      arch/x86/mm/ioremap.c:782:19: warning: Using plain integer as NULL pointer
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1d6cf1fe
    • H
      x86: two trivial sparse annotations · 9352f569
      Harvey Harrison 提交于
      Impact: fewer sparse warnings, no functional changes
      
      arch/x86/kernel/vsmp_64.c:87:14: warning: incorrect type in argument 1 (different address spaces)
      arch/x86/kernel/vsmp_64.c:87:14:    expected void const volatile [noderef] <asn:2>*addr
      arch/x86/kernel/vsmp_64.c:87:14:    got void *[assigned] address
      arch/x86/kernel/vsmp_64.c:88:22: warning: incorrect type in argument 1 (different address spaces)
      arch/x86/kernel/vsmp_64.c:88:22:    expected void const volatile [noderef] <asn:2>*addr
      arch/x86/kernel/vsmp_64.c:88:22:    got void *
      arch/x86/kernel/vsmp_64.c:100:23: warning: incorrect type in argument 2 (different address spaces)
      arch/x86/kernel/vsmp_64.c:100:23:    expected void volatile [noderef] <asn:2>*addr
      arch/x86/kernel/vsmp_64.c:100:23:    got void *
      arch/x86/kernel/vsmp_64.c:101:23: warning: incorrect type in argument 1 (different address spaces)
      arch/x86/kernel/vsmp_64.c:101:23:    expected void const volatile [noderef] <asn:2>*addr
      arch/x86/kernel/vsmp_64.c:101:23:    got void *
      arch/x86/mm/gup.c:235:6: warning: incorrect type in argument 1 (different base types)
      arch/x86/mm/gup.c:235:6:    expected void const volatile [noderef] <asn:1>*<noident>
      arch/x86/mm/gup.c:235:6:    got unsigned long [unsigned] [assigned] start
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9352f569
    • Y
      x86: fix init_memory_mapping for [dc000000 - e0000000) - v2 · f96f57d9
      Yinghai Lu 提交于
      Impact: change over-mapping to precise mapping, fix /proc/meminfo output
      
      v2: fix less than 1G ram system handling
      
      when gart aperture is 0xdc000000 - 0xe0000000
      it return 0xc0000000 - 0xe0000000
      
      that is not right.
      
      this patch fix that will get exact mapping
      
      on 256g sytem with that aperture after patch
      LBSuse:~ # cat /proc/meminfo
      MemTotal:       264742432 kB
      MemFree:        263920628 kB
      Buffers:            1416 kB
      Cached:            24468 kB
      ...
      DirectMap4k:      5760 kB
      DirectMap2M:   3205120 kB
      DirectMap1G:  265289728 kB
      
      it is consistent to
      LBSuse:~ # cat /sys/kernel/debug/kernel_page_tables
      ..
      ---[ Low Kernel Mapping ]---
      0xffff880000000000-0xffff880000200000           2M     RW             GLB x  pte
      0xffff880000200000-0xffff880040000000        1022M     RW         PSE GLB x  pmd
      0xffff880040000000-0xffff8800c0000000           2G     RW         PSE GLB NX pud
      0xffff8800c0000000-0xffff8800d7e00000         382M     RW         PSE GLB NX pmd
      0xffff8800d7e00000-0xffff8800d7fa0000        1664K     RW             GLB NX pte
      0xffff8800d7fa0000-0xffff8800d8000000         384K                           pte
      0xffff8800d8000000-0xffff8800dc000000          64M                           pmd
      0xffff8800dc000000-0xffff8800e0000000          64M     RW         PSE GLB NX pmd
      0xffff8800e0000000-0xffff880100000000         512M                           pmd
      0xffff880100000000-0xffff880800000000          28G     RW         PSE GLB NX pud
      0xffff880800000000-0xffff880824600000         582M     RW         PSE GLB NX pmd
      0xffff880824600000-0xffff8808247f0000        1984K     RW             GLB NX pte
      0xffff8808247f0000-0xffff880824800000          64K     RW     PCD     GLB NX pte
      0xffff880824800000-0xffff880840000000         440M     RW         PSE GLB NX pmd
      0xffff880840000000-0xffff884000000000         223G     RW         PSE GLB NX pud
      0xffff884000000000-0xffff884028000000         640M     RW         PSE GLB NX pmd
      0xffff884028000000-0xffff884040000000         384M                           pmd
      0xffff884040000000-0xffff888000000000         255G                           pud
      0xffff888000000000-0xffffc20000000000       58880G                           pgd
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f96f57d9
  16. 28 10月, 2008 4 次提交
  17. 27 10月, 2008 1 次提交
  18. 23 10月, 2008 1 次提交
  19. 22 10月, 2008 2 次提交
  20. 20 10月, 2008 1 次提交
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
  21. 18 10月, 2008 1 次提交
  22. 16 10月, 2008 2 次提交
  23. 14 10月, 2008 4 次提交