1. 22 1月, 2009 1 次提交
  2. 20 1月, 2009 1 次提交
    • N
      x86: optimise x86's do_page_fault (C entry point for the page fault path) · 92181f19
      Nick Piggin 提交于
      Impact: cleanup, restructure code to improve assembly
      
      gcc isn't _all_ that smart about spilling registers to stack or reusing
      stack slots, even with branch annotations. do_page_fault contained a lot
      of functionality, so split unlikely paths into their own functions, and
      mark them as noinline just to be sure. I consider this actually to be
      somewhat of a cleanup too: the main function now contains about half
      the number of lines so the normal path is easier to read, while the error
      cases are also nicely split away.
      
      Also, ensure the order of arguments to functions is always the same: regs,
      addr, error_code. This can reduce code size a tiny bit, and just looks neater
      too.
      
      And add a couple of branch annotations.
      
      Before:
        do_page_fault:
                subq    $360, %rsp      #,
      
      After:
        do_page_fault:
                subq    $56, %rsp       #,
      
      bloat-o-meter:
        add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542)
        function                                     old     new   delta
        __bad_area_nosemaphore                         -     506    +506
        no_context                                     -     474    +474
        vmalloc_fault                                  -     424    +424
        spurious_fault                                 -     358    +358
        mm_fault_error                                 -     272    +272
        bad_area_access_error                          -      89     +89
        bad_area                                       -      89     +89
        bad_area_nosemaphore                           -      10     +10
        do_page_fault                               2464     784   -1680
      
      Yes, the total size increases by 542 bytes, due to the extra function calls.
      But these will very rarely be called (except for vmalloc_fault) in a normal
      workload. Importantly, do_page_fault is less than 1/3rd it's original size,
      and touches far less stack.
      
      Existing gotos and branch hints did move a lot of the infrequently used text
      out of the fastpath, but that's even further improved after this patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      92181f19
  3. 13 1月, 2009 1 次提交
    • A
      x86: avoid theoretical vmalloc fault loop · f313e123
      Andi Kleen 提交于
      Ajith Kumar noticed:
      
       I was going through the vmalloc fault handling for x86_64 and am unclear
       about the following lines in the vmalloc_fault() function.
      
       pgd = pgd_offset(current->mm ?: &init_mm, address);
       pgd_ref = pgd_offset_k(address);
      
       Here the intention is to get the pgd corresponding to the current process
       and sync it up with the pgd in init_mm(obtained from pgd_offset_k).
       However, for kernel threads current->mm is NULL and hence pgd =
       pgd_offset(init_mm, address) = pgd_ref which means the fault handler
       returns without setting the pgd entry in the MM structure in the context
       of which the kernel thread has faulted.  This could lead to never-ending
       faults and busy looping of kernel threads like pdflush.  So, shouldn't the
       pgd = pgd_offset(current->mm ?: &init_mm, address); be pgd =
       pgd_offset(current->active_mm ?: &init_mm, address);
      
      We can use active_mm unconditionally because it should be always set.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f313e123
  4. 07 1月, 2009 1 次提交
    • N
      mm: invoke oom-killer from page fault · 1c0fe6e3
      Nick Piggin 提交于
      Rather than have the pagefault handler kill a process directly if it gets
      a VM_FAULT_OOM, have it call into the OOM killer.
      
      With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
      oom killing throttling, oom priority adjustment or selective disabling,
      panic on oom, etc), it's silly to unconditionally kill the faulting
      process at page fault time.  Create a hook for pagefault oom path to call
      into instead.
      
      Only converted x86 and uml so far.
      
      [akpm@linux-foundation.org: make __out_of_memory() static]
      [akpm@linux-foundation.org: fix comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeff Dike <jdike@addtoit.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c0fe6e3
  5. 14 11月, 2008 1 次提交
  6. 27 10月, 2008 1 次提交
  7. 22 10月, 2008 1 次提交
  8. 14 10月, 2008 1 次提交
  9. 13 10月, 2008 2 次提交
    • L
      x86/mm: do not trigger a kernel warning if user-space disables interrupts and... · 891cffbd
      Linus Torvalds 提交于
      x86/mm: do not trigger a kernel warning if user-space disables interrupts and generates a page fault
      
      Arjan reported a spike in the following bug pattern in v2.6.27:
      
         http://www.kerneloops.org/searchweek.php?search=lock_page
      
      which happens because hwclock started triggering warnings due to
      a (correct) might_sleep() check in the MM code.
      
      The warning occurs because hwclock uses this dubious sequence of
      code to run "atomic" code:
      
        static unsigned long
        atomic(const char *name, unsigned long (*op)(unsigned long),
               unsigned long arg)
        {
          unsigned long v;
          __asm__ volatile ("cli");
          v = (*op)(arg);
          __asm__ volatile ("sti");
          return v;
        }
      
      Then it pagefaults in that "atomic" section, triggering the warning.
      
      There is no way the kernel could provide "atomicity" in this path,
      a page fault is a cannot-continue machine event so the kernel has to
      wait for the page to be filled in.
      
      Even if it was just a minor fault we'd have to take locks and might have
      to spend quite a bit of time with interrupts disabled - not nice to irq
      latencies in general.
      
      So instead just enable interrupts in the pagefault path unconditionally
      if we come from user-space, and handle the fault.
      
      Also, while touching this code, unify some trivial parts of the x86
      VM paths at the same time.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NArjan van de Ven <arjan@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      891cffbd
    • A
      traps: x86: remove trace_hardirqs_fixup from pagefault handler · 69c89b5b
      Alexander van Heukelum 提交于
      The last use of trace_hardirqs_fixup is unnecessary, because the
      trap is taken with interrupt off on i386 as well as x86_64, and
      the irq-tracer is notified of this from the assembly code.
      
      trace_hardirqs_fixup and trace_hardirqs_fixup_flags are removed
      from include/asm-x86/irqflags.h as they are no longer used.
      Signed-off-by: NAlexander van Heukelum <heukelum@fastmail.fm>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      69c89b5b
  10. 07 9月, 2008 3 次提交
  11. 23 7月, 2008 1 次提交
  12. 08 7月, 2008 1 次提交
    • J
      x86: simplify vmalloc_sync_all · 67350a5c
      Jeremy Fitzhardinge 提交于
      vmalloc_sync_all() is only called from register_die_notifier and
      alloc_vm_area.  Neither is on any performance-critical paths, so
      vmalloc_sync_all() itself is not on any hot paths.
      
      Given that the optimisations in vmalloc_sync_all add a fair amount of
      code and complexity, and are fairly hard to evaluate for correctness,
      it's better to just remove them to simplify the code rather than worry
      about its absolute performance.
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: xen-devel <xen-devel@lists.xensource.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      67350a5c
  13. 03 7月, 2008 1 次提交
  14. 01 7月, 2008 1 次提交
  15. 13 6月, 2008 1 次提交
    • H
      x86: fix endless page faults in mount_block_root for Linux 2.6 · b29c701d
      Henry Nestler 提交于
      Page faults in kernel address space between PAGE_OFFSET up to
      VMALLOC_START should not try to map as vmalloc.
      
      Fix rarely endless page faults inside mount_block_root for root
      filesystem at boot time.
      
      All 32bit kernels up to 2.6.25 can fail into this hole.
      I can not present this under native linux kernel. I see, that the 64bit
      has fixed the problem. I copied the same lines into 32bit part.
      
      Recorded debugs are from coLinux kernel 2.6.22.18 (virtualisation):
      http://www.henrynestler.com/colinux/testing/pfn-check-0.7.3/20080410-antinx/bug16-recursive-page-fault-endless.txt
      The physicaly memory was trimmed down to 192MB to better catch the bug.
      More memory gets the bug more rarely.
      
      Details, how every x86 32bit system can fail:
      
      Start from "mount_block_root",
      http://lxr.linux.no/linux/init/do_mounts.c#L297
      There the variable "fs_names" got one memory page with 4096 bytes.
      Variable "p" walks through the existing file system types. The first
      string is no problem.
      But, with the second loop in mount_block_root the offset of "p" is not
      at beginning of page, the offset is for example +9, if "reiserfs" is the
      first in list.
      Than calls do_mount_root, and lands in sys_mount.
      Remember: Variable "type_page" contains now "fs_type+9" and not contains
      a full page.
      The sys_mount copies 4096 bytes with function "exact_copy_from_user()":
      http://lxr.linux.no/linux/fs/namespace.c#L1540
      
      Mostly exist pages after the buffer "fs_names+4096+9" and the page fault
      handler was not called. No problem.
      
      In the case, if the page after "fs_names+4096" is not mapped, the page
      fault handler was called from http://lxr.linux.no/linux/fs/namespace.c#L1320
      
      The do_page_fault gots an address 0xc03b4000.
      It's kernel address, address >= TASK_SIZE, but not from vmalloc! It's
      from "__getname()" alias "kmem_cache_alloc".
      The "error_code" is 0. "vmalloc_fault" will be call:
      http://lxr.linux.no/linux/arch/i386/mm/fault.c#L332
      
      "vmalloc_fault" tryed to find the physical page for a non existing
      virtual memory area. The macro "pte_present" in vmalloc_fault()
      got a next page fault for 0xc0000ed0 at:
      http://lxr.linux.no/linux/arch/i386/mm/fault.c#L282
      
      No PTE exist for such virtual address. The page fault handler was trying
      to sync the physical page for the PTE lockup.
      
      This called vmalloc_fault() again for address 0xc000000, and that also
      was not existing. The endless began...
      
      In normal case the cpu would still loop with disabled interrrupts. Under
      coLinux this was catched by a stack overflow inside printk debugs.
      Signed-off-by: NHenry Nestler <henry.nestler@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b29c701d
  16. 26 5月, 2008 1 次提交
    • E
      stackprotector: use canary at end of stack to indicate overruns at oops time · 7c9f8861
      Eric Sandeen 提交于
      (Updated with a common max-stack-used checker that knows about
      the canary, as suggested by Joe Perches)
      
      Use a canary at the end of the stack to clearly indicate
      at oops time whether the stack has ever overflowed.
      
      This is a very simple implementation with a couple of
      drawbacks:
      
      1) a thread may legitimately use exactly up to the last
         word on the stack
      
       -- but the chances of doing this and then oopsing later seem slim
      
      2) it's possible that the stack usage isn't dense enough
         that the canary location could get skipped over
      
       -- but the worst that happens is that we don't flag the overrun
       -- though this happens fairly often in my testing :(
      
      With the code in place, an intentionally-bloated stack oops might
      do:
      
      BUG: unable to handle kernel paging request at ffff8103f84cc680
      IP: [<ffffffff810253df>] update_curr+0x9a/0xa8
      PGD 8063 PUD 0
      Thread overran stack or stack corrupted
      Oops: 0000 [1] SMP
      CPU 0
      ...
      
      ... unless the stack overrun is so bad that it corrupts some other
      thread.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7c9f8861
  17. 24 5月, 2008 3 次提交
    • P
      x86: mmiotrace full patch, preview 1 · 0fd0e3da
      Pekka Paalanen 提交于
      kmmio.c handles the list of mmio probes with callbacks, list of traced
      pages, and attaching into the page fault handler and die notifier. It
      arms, traps and disarms the given pages, this is the core of mmiotrace.
      
      mmio-mod.c is a user interface, hooking into ioremap functions and
      registering the mmio probes. It also decodes the required information
      from trapped mmio accesses via the pre and post callbacks in each probe.
      Currently, hooking into ioremap functions works by redefining the symbols
      of the target (binary) kernel module, so that it calls the traced
      versions of the functions.
      
      The most notable changes done since the last discussion are:
      - kmmio.c is a built-in, not part of the module
      - direct call from fault.c to kmmio.c, removing all dynamic hooks
      - prepare for unregistering probes at any time
      - make kmmio re-initializable and accessible to more than one user
      - rewrite kmmio locking to remove all spinlocks from page fault path
      
      Can I abuse call_rcu() like I do in kmmio.c:unregister_kmmio_probe()
      or is there a better way?
      
      The function called via call_rcu() itself calls call_rcu() again,
      will this work or break? There I need a second grace period for RCU
      after the first grace period for page faults.
      
      Mmiotrace itself (mmio-mod.c) is still a module, I am going to attack
      that next. At some point I will start looking into how to make mmiotrace
      a tracer component of ftrace (thanks for the hint, Ingo). Ftrace should
      make the user space part of mmiotracing as simple as
      'cat /debug/trace/mmio > dump.txt'.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      0fd0e3da
    • P
      x86: explicit call to mmiotrace in do_page_fault() · 10c43d2e
      Pekka Paalanen 提交于
      The custom page fault handler list is replaced with a single function
      pointer. All related functions and variables are renamed for
      mmiotrace.
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: pq@iki.fi
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      10c43d2e
    • P
      x86: add a list for custom page fault handlers. · 86069782
      Pekka Paalanen 提交于
      Provides kernel modules a way to register custom page fault handlers.
      On every page fault this will call a list of registered functions. The
      functions may handle the fault and force do_page_fault() to return
      immediately.
      
      This functionality is similar to the now removed page fault notifiers.
      Custom page fault handlers are used by debugging and reverse engineering
      tools. Mmiotrace is one such tool and a patch to add it into the tree
      will follow.
      
      The custom page fault handlers are called earlier in do_page_fault()
      than the page fault notifiers were.
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      86069782
  18. 17 4月, 2008 2 次提交
  19. 28 3月, 2008 1 次提交
    • I
      x86: prefetch fix #2 · 3085354d
      Ingo Molnar 提交于
      Linus noticed a second bug and an uncleanliness:
      
       - we'd return on any instruction fetch fault
      
       - we'd use both the value of 16 and the PF_INSTR symbol which are
         the same and make no sense
      
      the cleanup nicely unifies this piece of logic.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3085354d
  20. 27 3月, 2008 1 次提交
  21. 15 2月, 2008 1 次提交
  22. 07 2月, 2008 2 次提交
    • I
      x86: fix deadlock, make pgd_lock irq-safe · 58d5d0d8
      Ingo Molnar 提交于
      lockdep just caught this one:
      
      =================================
      [ INFO: inconsistent lock state ]
      2.6.24 #38
      ---------------------------------
      inconsistent {in-softirq-W} -> {softirq-on-W} usage.
      swapper/1 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (pgd_lock){-+..}, at: [<ffffffff8022a9ea>] mm_init+0x1da/0x250
      {in-softirq-W} state was registered at:
        [<ffffffffffffffff>] 0xffffffffffffffff
      irq event stamp: 394559
      hardirqs last  enabled at (394559): [<ffffffff80267f0a>] get_page_from_freelist+0x30a/0x4c0
      hardirqs last disabled at (394558): [<ffffffff80267d25>] get_page_from_freelist+0x125/0x4c0
      softirqs last  enabled at (393952): [<ffffffff80232f8e>] __do_softirq+0xce/0xe0
      softirqs last disabled at (393945): [<ffffffff8020c57c>] call_softirq+0x1c/0x30
      
      other info that might help us debug this:
      no locks held by swapper/1.
      
      stack backtrace:
      Pid: 1, comm: swapper Not tainted 2.6.24 #38
      
      Call Trace:
       [<ffffffff8024e1fb>] print_usage_bug+0x18b/0x190
       [<ffffffff8024f55d>] mark_lock+0x53d/0x560
       [<ffffffff8024fffa>] __lock_acquire+0x3ca/0xed0
       [<ffffffff80250ba8>] lock_acquire+0xa8/0xe0
       [<ffffffff8022a9ea>] ? mm_init+0x1da/0x250
       [<ffffffff809bcd10>] _spin_lock+0x30/0x70
       [<ffffffff8022a9ea>] mm_init+0x1da/0x250
       [<ffffffff8022aa99>] mm_alloc+0x39/0x50
       [<ffffffff8028b95a>] bprm_mm_init+0x2a/0x1a0
       [<ffffffff8028d12b>] do_execve+0x7b/0x220
       [<ffffffff80209776>] sys_execve+0x46/0x70
       [<ffffffff8020c214>] kernel_execve+0x64/0xd0
       [<ffffffff8020901e>] ? _stext+0x1e/0x20
       [<ffffffff802090ba>] init_post+0x9a/0xf0
       [<ffffffff809bc5f6>] ? trace_hardirqs_on_thunk+0x35/0x3a
       [<ffffffff8024f75a>] ? trace_hardirqs_on+0xba/0xd0
       [<ffffffff8020c1a8>] ? child_rip+0xa/0x12
       [<ffffffff8020bcbc>] ? restore_args+0x0/0x44
       [<ffffffff8020c19e>] ? child_rip+0x0/0x12
      
      turns out that pgd_lock has been used on 64-bit x86 in an irq-unsafe
      way for almost two years, since commit 8c914cb7.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      58d5d0d8
    • T
      x86: make spurious fault handler aware of large mappings · d8b57bb7
      Thomas Gleixner 提交于
      In very rare cases, on certain CPUs, we could end up in the spurious
      fault handler and ignore a large pud/pmd mapping. The resulting pte
      pointer points into the mapped physical space and dereferencing it
      will fault recursively.
      
      Make the code aware of large mappings and do the permission check
      on the pmd/pud entry, when a large pud/pmd mapping is detected.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d8b57bb7
  23. 04 2月, 2008 2 次提交
  24. 02 2月, 2008 1 次提交
  25. 30 1月, 2008 8 次提交