1. 28 4月, 2012 2 次提交
    • S
      ftrace/x86: Remove the complex ftrace NMI handling code · 4a6d70c9
      Steven Rostedt 提交于
      As ftrace function tracing would require modifying code that could
      be executed in NMI context, which is not stopped with stop_machine(),
      ftrace had to do a complex algorithm with various stages of setup
      and memory barriers to make it work.
      
      With the new breakpoint method, this is no longer required. The changes
      to the code can be done without any problem in NMI context, as well as
      without stop machine altogether. Remove the complex code as it is
      no longer needed.
      
      Also, a lot of the notrace annotations could be removed from the
      NMI code as it is now safe to trace them. With the exception of
      do_nmi itself, which does some special work to handle running in
      the debug stack. The breakpoint method can cause NMIs to double
      nest the debug stack if it's not setup properly, and that is done
      in do_nmi(), thus that function must not be traced.
      
      (Note the arch sh may want to do the same)
      
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      4a6d70c9
    • S
      ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine · 08d636b6
      Steven Rostedt 提交于
      This method changes x86 to add a breakpoint to the mcount locations
      instead of calling stop machine.
      
      Now that iret can be handled by NMIs, we perform the following to
      update code:
      
      1) Add a breakpoint to all locations that will be modified
      
      2) Sync all cores
      
      3) Update all locations to be either a nop or call (except breakpoint
         op)
      
      4) Sync all cores
      
      5) Remove the breakpoint with the new code.
      
      6) Sync all cores
      
      [
        Added updates that Masami suggested:
         Use unlikely(modifying_ftrace_code) in int3 trap to keep kprobes efficient.
         Don't use NOTIFY_* in ftrace handler in int3 as it is not a notifier.
      ]
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      08d636b6
  2. 06 4月, 2012 2 次提交
  3. 03 4月, 2012 1 次提交
  4. 30 3月, 2012 4 次提交
    • P
      ACPI: Fix use-after-free in acpi_map_lsapic · ac909ec3
      Petr Vandrovec 提交于
      When processor is being hot-added to the system, acpi_map_lsapic invokes
      ACPI _MAT method to find APIC ID and flags, verifies that returned structure
      is indeed ACPI's local APIC structure, and that flags contain MADT_ENABLED
      bit.  Then saves APIC ID, frees structure - and accesses structure when
      computing arguments for acpi_register_lapic call.  Which sometime leads
      to acpi_register_lapic call being made with second argument zero, failing
      to bring processor online with error 'Unable to map lapic to logical cpu
      number'.
      
      As lapic->lapic_flags & ACPI_MADT_ENABLED was already confirmed to be non-zero
      few lines above, we can just pass unconditional ACPI_MADT_ENABLED to the
      acpi_register_lapic.
      Signed-off-by: NPetr Vandrovec <petr@vmware.com>
      Signed-off-by: NAlok N Kataria <akataria@vmware.com>
      Reviewed-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      ac909ec3
    • B
      idle, x86: Allow off-lined CPU to enter deeper C states · 1a022e3f
      Boris Ostrovsky 提交于
      Currently when a CPU is off-lined it enters either MWAIT-based idle or,
      if MWAIT is not desired or supported, HLT-based idle (which places the
      processor in C1 state). This patch allows processors without MWAIT
      support to stay in states deeper than C1.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@amd.com>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      1a022e3f
    • L
      x86: Remove the ancient and deprecated disable_hlt() and enable_hlt() facility · f6365201
      Len Brown 提交于
      The X86_32-only disable_hlt/enable_hlt mechanism was used by the
      32-bit floppy driver. Its effect was to replace the use of the
      HLT instruction inside default_idle() with cpu_relax() - essentially
      it turned off the use of HLT.
      
      This workaround was commented in the code as:
      
       "disable hlt during certain critical i/o operations"
      
       "This halt magic was a workaround for ancient floppy DMA
        wreckage. It should be safe to remove."
      
      H. Peter Anvin additionally adds:
      
       "To the best of my knowledge, no-hlt only existed because of
        flaky power distributions on 386/486 systems which were sold to
        run DOS.  Since DOS did no power management of any kind,
        including HLT, the power draw was fairly uniform; when exposed
        to the much hhigher noise levels you got when Linux used HLT
        caused some of these systems to fail.
      
        They were by far in the minority even back then."
      
      Alan Cox further says:
      
       "Also for the Cyrix 5510 which tended to go castors up if a HLT
        occurred during a DMA cycle and on a few other boxes HLT during
        DMA tended to go astray.
      
        Do we care ? I doubt it. The 5510 was pretty obscure, the 5520
        fixed it, the 5530 is probably the oldest still in any kind of
        use."
      
      So, let's finally drop this.
      Signed-off-by: NLen Brown <len.brown@intel.com>
      Signed-off-by: NJosh Boyer <jwboyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: N"H. Peter Anvin" <hpa@zytor.com>
      Acked-by: NAlan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Stephen Hemminger <shemminger@vyatta.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/n/tip-3rhk9bzf0x9rljkv488tloib@git.kernel.org
      [ If anyone cares then alternative instruction patching could be
        used to replace HLT with a one-byte NOP instruction. Much simpler. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f6365201
    • J
      x86,kgdb: Fix DEBUG_RODATA limitation using text_poke() · 3751d3e8
      Jason Wessel 提交于
      There has long been a limitation using software breakpoints with a
      kernel compiled with CONFIG_DEBUG_RODATA going back to 2.6.26. For
      this particular patch, it will apply cleanly and has been tested all
      the way back to 2.6.36.
      
      The kprobes code uses the text_poke() function which accommodates
      writing a breakpoint into a read-only page.  The x86 kgdb code can
      solve the problem similarly by overriding the default breakpoint
      set/remove routines and using text_poke() directly.
      
      The x86 kgdb code will first attempt to use the traditional
      probe_kernel_write(), and next try using a the text_poke() function.
      The break point install method is tracked such that the correct break
      point removal routine will get called later on.
      
      Cc: x86@kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: stable@vger.kernel.org # >= 2.6.36
      Inspried-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      3751d3e8
  5. 29 3月, 2012 5 次提交
  6. 28 3月, 2012 2 次提交
  7. 27 3月, 2012 1 次提交
  8. 26 3月, 2012 1 次提交
  9. 24 3月, 2012 5 次提交
  10. 23 3月, 2012 6 次提交
  11. 22 3月, 2012 2 次提交
    • X
      mm: search from free_area_cache for the bigger size · b716ad95
      Xiao Guangrong 提交于
      If the required size is bigger than cached_hole_size it is better to
      search from free_area_cache - it is easier to get a free region,
      specifically for the 64 bit process whose address space is large enough
      
      Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b716ad95
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  12. 20 3月, 2012 2 次提交
  13. 19 3月, 2012 1 次提交
    • S
      x86: Fix section warnings · 943bc7e1
      Steffen Persvold 提交于
      Fix the following section warnings :
      
      WARNING: vmlinux.o(.text+0x49dbc): Section mismatch in reference
      from the function acpi_map_cpu2node() to the variable
      .cpuinit.data:__apicid_to_node The function acpi_map_cpu2node()
      references the variable __cpuinitdata __apicid_to_node. This is
      often because acpi_map_cpu2node lacks a __cpuinitdata
      annotation or the annotation of __apicid_to_node is wrong.
      
      WARNING: vmlinux.o(.text+0x49dc1): Section mismatch in reference
      from the function acpi_map_cpu2node() to the function
      .cpuinit.text:numa_set_node() The function acpi_map_cpu2node()
      references the function __cpuinit numa_set_node(). This is often
      because acpi_map_cpu2node lacks a __cpuinit  annotation or the
      annotation of numa_set_node is wrong.
      
      WARNING: vmlinux.o(.text+0x526e77): Section mismatch in
      reference from the function prealloc_protection_domains() to the
      function .init.text:alloc_passthrough_domain() The function
      prealloc_protection_domains() references the function __init
      alloc_passthrough_domain(). This is often because
      prealloc_protection_domains lacks a __init  annotation or the annotation of alloc_passthrough_domain is wrong.
      Signed-off-by: NSteffen Persvold <sp@numascale.com>
      Link: http://lkml.kernel.org/r/1331810188-24785-1-git-send-email-sp@numascale.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      943bc7e1
  14. 17 3月, 2012 1 次提交
  15. 16 3月, 2012 3 次提交
  16. 14 3月, 2012 2 次提交