1. 23 7月, 2007 3 次提交
    • A
      x86: Fix alternatives and kprobes to remap write-protected kernel text · 19d36ccd
      Andi Kleen 提交于
      Reenable kprobes and alternative patching when the kernel text is write
      protected by DEBUG_RODATA
      
      Add a general utility function to change write protected text.  The new
      function remaps the code using vmap to write it and takes care of CPU
      synchronization.  It also does CLFLUSH to make icache recovery faster.
      
      There are some limitations on when the function can be used, see the
      comment.
      
      This is a newer version that also changes the paravirt_ops code.
      text_poke also supports multi byte patching now.
      
      Contains bug fixes from Zach Amsden and suggestions from Mathieu
      Desnoyers.
      
      Cc: Jan Beulich <jbeulich@novell.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
      Cc: Zach Amsden <zach@vmware.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19d36ccd
    • G
      x86_64: Use read and write crX in .c files · f51c9452
      Glauber de Oliveira Costa 提交于
      This patch uses the read and write functions provided at system.h
      for control registers instead of writting raw assembly over and
      over again in .c files. Functions to manipulate cr2 and cr8 were
      provided, as they were lacking.
      
      Also, removed some extra space after closing brackets
      Signed-off-by: NGlauber de Oliveira Costa <gcosta@redhat.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f51c9452
    • M
      x86: i386-show-unhandled-signals-v3 · abd4f750
      Masoud Asgharifard Sharbiani 提交于
      This patch makes the i386 behave the same way that x86_64 does when a
      segfault happens.  A line gets printed to the kernel log so that tools
      that need to check for failures can behave more uniformly between
      debug.show_unhandled_signals sysctl variable to 0 (or by doing echo 0 >
      /proc/sys/debug/exception-trace)
      
      Also, all of the lines being printed are now using printk_ratelimit() to
      deny the ability of DoS from a local user with a program like the
      following:
      
      main()
      {
             while (1)
                     if (!fork()) *(int *)0 = 0;
      }
      
      This new revision also includes the fix that Andrew did which got rid of
      new sysctl that was added to the system in earlier versions of this.
      Also, 'show-unhandled-signals' sysctl has been renamed back to the old
      'exception-trace' to avoid breakage of people's scripts.
      
      AK: Enabling by default for i386 will be likely controversal, but let's see what happens
      AK: Really folks, before complaining just fix your segfaults
      AK: I bet this will find a lot of silent issues
      Signed-off-by: NMasoud Sharbiani <masouds@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      [ Personally, I've found the complaints useful on x86-64, so I'm all for
        this. That said, I wonder if we could do it more prettily..   -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd4f750
  2. 22 7月, 2007 11 次提交
  3. 20 7月, 2007 1 次提交
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
  4. 22 6月, 2007 1 次提交
  5. 21 6月, 2007 1 次提交
    • A
      x86: change_page_attr bandaids · 018d2ad0
      Andi Kleen 提交于
      - Disable CLFLUSH again; it is still broken. Always do WBINVD.
      - Always flush in the i386 case, not only when there are deferred pages.
      
      These are both brute-force inefficient fixes, to be improved
      next release cycle.
      
      The changes to i386 are a little more extensive than strictly
      needed (some dead code added), but it is more similar to the x86-64 version
      now and the dead code will be used soon.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      018d2ad0
  6. 09 6月, 2007 1 次提交
    • B
      fix sysrq-m oops · 12710a56
      Bob Picco 提交于
      We aren't sampling for holes in memory.  Thus we encounter a section hole
      with empty section map pointer for SPARSEMEM and OOPs for show_mem.  This
      issue has been seen in 2.6.21, current git and current mm.  The patch below
      is for mainline and mm.  It was boot tested for SPARSEMEM, current VMEMMAP
      of Andy's in mm ml and DISCONTIGMEM.  A slightly different patch will be
      posted to stable for 2.6.21.
      
      Previous to commit f0a5a58a memory_present
      was called for node_start_pfn to node_end_pfn.  This would cover the
      hole(s) with reserved pages and valid sections.  Most SPARSEMEM supported
      arches do a pfn_valid check in show_mem before computing the page structure
      address.
      
      This issue was brought to my attention on IRC by Arnaldo Carvalho de Melo.
      Thanks to Arnaldo for testing.
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Cc: Chuck Ebbert <cebbert@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12710a56
  7. 08 6月, 2007 1 次提交
    • S
      enable interrupts in user path of page fault. · e5e3c84b
      Steven Rostedt 提交于
      This is a minor fix, but what is currently there is essentially wrong.
      In do_page_fault, if the faulting address from user code happens to be
      in kernel address space (int *p = (int*)-1; p = 0xbed;)  then the
      do_page_fault handler will jump over the local_irq_enable with the
      
        goto bad_area_nosemaphore;
      
      But the first line there sees this is user code and goes through the
      process of sending a signal to send SIGSEGV to the user task. This whole
      time interrupts are disabled and the task can not be preempted by a
      higher priority task.
      
      This patch always enables interrupts in the user path of the
      bad_area_nosemaphore.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5e3c84b
  8. 01 6月, 2007 1 次提交
  9. 09 5月, 2007 3 次提交
  10. 07 5月, 2007 1 次提交
    • L
      Revert "[PATCH] x86: __pa and __pa_symbol address space separation" · e3ebadd9
      Linus Torvalds 提交于
      This was broken.  It adds complexity, for no good reason.  Rather than
      separate __pa() and __pa_symbol(), we should deprecate __pa_symbol(),
      and preferably __pa() too - and just use "virt_to_phys()" instead, which
      is more readable and has nicer semantics.
      
      However, right now, just undo the separation, and make __pa_symbol() be
      the exact same as __pa().  That fixes the bugs this patch introduced,
      and we can do the fairly obvious cleanups later.
      
      Do the new __phys_addr() function (which is now the actual workhorse for
      the unified __pa()/__pa_symbol()) as a real external function, that way
      all the potential issues with compile/link-time optimizations of
      constant symbol addresses go away, and we can also, if we choose to, add
      more sanity-checking of the argument.
      
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@in.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3ebadd9
  11. 03 5月, 2007 13 次提交
    • A
      [PATCH] x86-64: Don't enable NUMA for a single node in K8 NUMA scanning · 3bea9c97
      Andi Kleen 提交于
      This was supposed to see the full memory on a ASUS A8SX motherboard
      with 4GB RAM where the northbridge reports less memory, but it didn't
      help there. But it's a reasonable change so let's include it anyways.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      3bea9c97
    • S
      [PATCH] x86-64: set node_possible_map at runtime - try 2 · e3f1caee
      Suresh Siddha 提交于
      Set the node_possible_map at runtime on x86_64.  On a non NUMA system,
      num_possible_nodes() will now say '1'.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      e3f1caee
    • K
      [PATCH] x86-64: Inhibit machine from asserting an NMI when doing Alt-SysRq-M operation. · ae32b129
      Konrad Rzeszutek 提交于
      This patch touches the NMI watchdog every MAX_ORDER_NR_PAGES
      to inhibit the machine from triggering an NMI while the CPUs
      are locked. This situation is happening on boxes with more
      than 64CPUs and 128GB of RAM when Alt-SysRq-m is performed.
      
      It has been succesfully tested for regression on uni, 2, 4, 8
      32, and 64 CPU boxes with various memory configuration.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      ae32b129
    • C
      [PATCH] x86-64: use lru instead of page->index and page->private for pgd lists management. · 2bff7383
      Christoph Lameter 提交于
      x86_64 currently simulates a list using the index and private fields of the
      page struct.  Seems that the code was inherited from i386.  But x86_64 does
      not use the slab to allocate pgds and pmds etc.  So the lru field is not
      used by the slab and therefore available.
      
      This patch uses standard list operations on page->lru to realize pgd
      tracking.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2bff7383
    • J
      [PATCH] x86: tighten kernel image page access rights · 6fb14755
      Jan Beulich 提交于
      On x86-64, kernel memory freed after init can be entirely unmapped instead
      of just getting 'poisoned' by overwriting with a debug pattern.
      
      On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table
      can also be write-protected.
      
      Compared to the first version, this one prevents re-creating deleted
      mappings in the kernel image range on x86-64, if those got removed
      previously. This, together with the original changes, prevents temporarily
      having inconsistent mappings when cacheability attributes are being
      changed on such pages (e.g. from AGP code). While on i386 such duplicate
      mappings don't exist, the same change is done there, too, both for
      consistency and because checking pte_present() before using various other
      pte_XXX functions is a requirement anyway. At once, i386 code gets
      adjusted to use pte_huge() instead of open coding this.
      
      AK: split out cpa() changes
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      6fb14755
    • J
      [PATCH] x86: Improve handling of kernel mappings in change_page_attr · d01ad8dd
      Jan Beulich 提交于
      Fix various broken corner cases in i386 and x86-64 change_page_attr.
      
      AK: split off from tighten kernel image access rights
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      d01ad8dd
    • D
      [PATCH] x86-64: fixed size remaining fake nodes · 382591d5
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to split the remaining system
      memory into nodes of fixed size.  Any leftover memory is allocated to a final
      node unless the command-line ends with a comma.
      
      For example:
        numa=fake=2*512,*128	gives two 512M nodes and the remaining system
      			memory is split into nodes of 128M each.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the size of the remaining nodes to be allocated is
      known based on their capacity for resource management.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      382591d5
    • D
      [PATCH] x86-64: split remaining fake nodes equally · 14694d73
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to split the remaining
      system memory into equal-sized nodes.
      
      For example:
      numa=fake=2*512,4*	gives two 512M nodes and the remaining system
      			memory is split into four approximately equal
      			chunks.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the granularity with which nodes shall be allocated
      is known.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      14694d73
    • D
      [PATCH] x86-64: configurable fake numa node sizes · 8b8ca80e
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to allow for configurable
      node sizes.  These nodes can be used in conjunction with cpusets for coarse
      memory resource management.
      
      The old command-line option is still supported:
        numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
      		actual machine.
      
      But now you may configure your system for the node sizes of your choice:
        numa=fake=2*512,1024,2*256
      		gives two 512M nodes, one 1024M node, two 256M nodes, and
      		the rest of system memory to a sixth node.
      
      The existing hash function is maintained to support the various node sizes
      that are possible with this implementation.
      
      Each node of the same size receives roughly the same amount of available
      pages, regardless of any reserved memory with its address range.  The total
      available pages on the system is calculated and divided by the number of equal
      nodes to allocate.  These nodes are then dynamically allocated and their
      borders extended until such time as their number of available pages reaches
      the required size.
      
      Configurable node sizes are recommended when used in conjunction with cpusets
      for memory control because it eliminates the overhead associated with scanning
      the zonelists of many smaller full nodes on page_alloc().
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8b8ca80e
    • V
      [PATCH] x86: __pa and __pa_symbol address space separation · 0dbf7028
      Vivek Goyal 提交于
      Currently __pa_symbol is for use with symbols in the kernel address
      map and __pa is for use with pointers into the physical memory map.
      But the code is implemented so you can usually interchange the two.
      
      __pa which is much more common can be implemented much more cheaply
      if it is it doesn't have to worry about any other kernel address
      spaces.  This is especially true with a relocatable kernel as
      __pa_symbol needs to peform an extra variable read to resolve
      the address.
      
      There is a third macro that is added for the vsyscall data
      __pa_vsymbol for finding the physical addesses of vsyscall pages.
      
      Most of this patch is simply sorting through the references to
      __pa or __pa_symbol and using the proper one.  A little of
      it is continuing to use a physical address when we have it
      instead of recalculating it several times.
      
      swapper_pgd is now NULL.  leave_mm now uses init_mm.pgd
      and init_mm.pgd is initialized at boot (instead of compile time)
      to the physmem virtual mapping of init_level4_pgd.  The
      physical address changed.
      
      Except for the for EMPTY_ZERO page all of the remaining references
      to __pa_symbol appear to be during kernel initialization.  So this
      should reduce the cost of __pa in the common case, even on a relocated
      kernel.
      
      As this is technically a semantic change we need to be on the lookout
      for anything I missed.  But it works for me (tm).
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      0dbf7028
    • V
      [PATCH] x86-64: Remove the identity mapping as early as possible · cfd243d4
      Vivek Goyal 提交于
      With the rewrite of the SMP trampoline and the early page
      allocator there is nothing that needs identity mapped pages,
      once we start executing C code.
      
      So add zap_identity_mappings into head64.c and remove
      zap_low_mappings() from much later in the code.  The functions
       are subtly different thus the name change.
      
      This also kills boot_level4_pgt which was from an earlier
      attempt to move the identity mappings as early as possible,
      and is now no longer needed.  Essentially I have replaced
      boot_level4_pgt with trampoline_level4_pgt in trampoline.S
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      cfd243d4
    • V
      [PATCH] x86-64: Kill temp boot pmds · dafe41ee
      Vivek Goyal 提交于
      Early in the boot process we need the ability to set
      up temporary mappings, before our normal mechanisms are
      initialized.  Currently this is used to map pages that
      are part of the page tables we are building and pages
      during the dmi scan.
      
      The core problem is that we are using the user portion of
      the page tables to implement this.  Which means that while
      this mechanism is active we cannot catch NULL pointer dereferences
      and we deviate from the normal ways of handling things.
      
      In this patch I modify early_ioremap to map pages into
      the kernel portion of address space, roughly where
      we will later put modules, and I make the discovery of
      which addresses we can use dynamic which removes all
      kinds of static limits and remove the dependencies
      on implementation details between different parts of the code.
      
      Now alloc_low_page() and unmap_low_page() use
      early_iomap() and early_iounmap() to allocate/map and
      unmap a page.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      dafe41ee
    • S
      [PATCH] x86-64: dma_ops as const · e6584504
      Stephen Hemminger 提交于
      The dma_ops structure can be const since it never changes
      after boot.
      Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      e6584504
  12. 24 4月, 2007 1 次提交
    • A
      [PATCH] x86-64: Always flush all pages in change_page_attr · 90767bd1
      Andi Kleen 提交于
      change_page_attr on x86-64 only flushed the TLB for pages that got
      reverted. That's not correct: it has to be flushed in all cases.
      
      This bug was added in some earlier changes.
      
      Just flush all pages for now.
      
      This could be done more efficiently, but for this late in the release
      this seem to be the best fix.
      
      Pointed out by Jan Beulich
      Signed-off-by: NAndi Kleen <ak@suse.de>
      90767bd1
  13. 15 2月, 2007 2 次提交