1. 03 4月, 2012 1 次提交
  2. 24 3月, 2012 3 次提交
    • J
      coredump: remove VM_ALWAYSDUMP flag · 909af768
      Jason Baron 提交于
      The motivation for this patchset was that I was looking at a way for a
      qemu-kvm process, to exclude the guest memory from its core dump, which
      can be quite large.  There are already a number of filter flags in
      /proc/<pid>/coredump_filter, however, these allow one to specify 'types'
      of kernel memory, not specific address ranges (which is needed in this
      case).
      
      Since there are no more vma flags available, the first patch eliminates
      the need for the 'VM_ALWAYSDUMP' flag.  The flag is used internally by
      the kernel to mark vdso and vsyscall pages.  However, it is simple
      enough to check if a vma covers a vdso or vsyscall page without the need
      for this flag.
      
      The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
      'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
      'MADV_DONTDUMP', and unset via 'MADV_DODUMP'.  The core dump filters
      continue to work the same as before unless 'MADV_DONTDUMP' is set on the
      region.
      
      The qemu code which implements this features is at:
      
        http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
      
      In my testing the qemu core dump shrunk from 383MB -> 13MB with this
      patch.
      
      I also believe that the 'MADV_DONTDUMP' flag might be useful for
      security sensitive apps, which might want to select which areas are
      dumped.
      
      This patch:
      
      The VM_ALWAYSDUMP flag is currently used by the coredump code to
      indicate that a vma is part of a vsyscall or vdso section.  However, we
      can determine if a vma is in one these sections by checking it against
      the gate_vma and checking for a non-NULL return value from
      arch_vma_name().  Thus, freeing a valuable vma bit.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      909af768
    • A
      x86: use for_each_clear_bit_from() · 0b2f4d4d
      Akinobu Mita 提交于
      Use for_each_clear_bit() to iterate over all the cleared bit in a
      memory region.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b2f4d4d
    • A
      bitops: rename for_each_set_bit_cont() in favor of analogous list.h function · 307b1cd7
      Akinobu Mita 提交于
      This renames for_each_set_bit_cont() to for_each_set_bit_from() because
      it is analogous to list_for_each_entry_from() in list.h rather than
      list_for_each_entry_continue().
      
      This doesn't remove for_each_set_bit_cont() for now.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      307b1cd7
  3. 23 3月, 2012 5 次提交
  4. 22 3月, 2012 8 次提交
    • K
      xen/smp: Fix bringup bug in AP code. · 106b4438
      Konrad Rzeszutek Wilk 提交于
      The CPU hotplug code has now a callback to help bring up the CPU.
      Without the call we end up getting:
      
       BUG: soft lockup - CPU#0 stuck for 29s! [migration/0:6]
      Modules linked in:
      CPU ] Pid: 6, comm: migration/0 Not tainted 3.3.0upstream-01180-ged378a52 #1 Dell Inc. PowerEdge T105 /0RR825
      RIP: e030:[<ffffffff810d3b8b>]  [<ffffffff810d3b8b>] stop_machine_cpu_stop+0x7b/0xf0
      RSP: e02b:ffff8800ceaabdb0  EFLAGS: 00000293
      .. snip..
      Call Trace:
       [<ffffffff810d3b10>] ? stop_one_cpu_nowait+0x50/0x50
       [<ffffffff810d3841>] cpu_stopper_thread+0xf1/0x1c0
       [<ffffffff815a9776>] ? __schedule+0x3c6/0x760
       [<ffffffff815aa749>] ? _raw_spin_unlock_irqrestore+0x19/0x30
       [<ffffffff810d3750>] ? res_counter_charge+0x150/0x150
       [<ffffffff8108dc76>] kthread+0x96/0xa0
       [<ffffffff815b27e4>] kernel_thread_helper+0x4/0x10
       [<ffffffff815aacbc>] ? retint_restore_ar
      
      Thix fixes it.
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      106b4438
    • J
      crypto: twofish-x86_64-3way - module init/exit functions should be static · ff0a70fe
      Jussi Kivilinna 提交于
      This caused conflict with camellia-x86_64 when compiled into kernel, same
      function names and not static.
      Reported-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NJussi Kivilinna <jussi.kivilinna@mbnet.fi>
      Acked-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      ff0a70fe
    • J
      crypto: camellia-x86_64 - module init/exit functions should be static · 676a3804
      Jussi Kivilinna 提交于
      This caused conflict with twofish-x86_64-3way when compiled into kernel,
      same function names and not static.
      Reported-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NJussi Kivilinna <jussi.kivilinna@mbnet.fi>
      Acked-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      676a3804
    • A
      numa_emulation: fix cpumask_of_node() · d71b5a73
      Andrea Arcangeli 提交于
      Without this fix the cpumask_of_node() for a fake=numa=2 is:
      
          cpumask 0 ff
          cpumask 1 ff
      
      with the fix it's correct and it's set to:
      
          cpumask 0 55
          cpumask 1 aa
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d71b5a73
    • X
      hugetlb: remove prev_vma from hugetlb_get_unmapped_area_topdown() · b69add21
      Xiao Guangrong 提交于
      After looking up the vma which covers or follows the cached search
      address, the following condition is always true:
      
      	!prev_vma || (addr >= prev_vma->vm_end)
      
      so we can stop checking the previous VMA altogether.
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b69add21
    • X
      mm: search from free_area_cache for the bigger size · b716ad95
      Xiao Guangrong 提交于
      If the required size is bigger than cached_hole_size it is better to
      search from free_area_cache - it is easier to get a free region,
      specifically for the 64 bit process whose address space is large enough
      
      Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b716ad95
    • X
      hugetlb: try to search again if it is really needed · cbde83e2
      Xiao Guangrong 提交于
      Search again only if some holes may be skipped in the first pass.
      
      [akpm@linux-foundation.org: clean up crazy compound definition]
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbde83e2
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  5. 21 3月, 2012 3 次提交
  6. 20 3月, 2012 3 次提交
  7. 19 3月, 2012 1 次提交
    • S
      x86: Fix section warnings · 943bc7e1
      Steffen Persvold 提交于
      Fix the following section warnings :
      
      WARNING: vmlinux.o(.text+0x49dbc): Section mismatch in reference
      from the function acpi_map_cpu2node() to the variable
      .cpuinit.data:__apicid_to_node The function acpi_map_cpu2node()
      references the variable __cpuinitdata __apicid_to_node. This is
      often because acpi_map_cpu2node lacks a __cpuinitdata
      annotation or the annotation of __apicid_to_node is wrong.
      
      WARNING: vmlinux.o(.text+0x49dc1): Section mismatch in reference
      from the function acpi_map_cpu2node() to the function
      .cpuinit.text:numa_set_node() The function acpi_map_cpu2node()
      references the function __cpuinit numa_set_node(). This is often
      because acpi_map_cpu2node lacks a __cpuinit  annotation or the
      annotation of numa_set_node is wrong.
      
      WARNING: vmlinux.o(.text+0x526e77): Section mismatch in
      reference from the function prealloc_protection_domains() to the
      function .init.text:alloc_passthrough_domain() The function
      prealloc_protection_domains() references the function __init
      alloc_passthrough_domain(). This is often because
      prealloc_protection_domains lacks a __init  annotation or the annotation of alloc_passthrough_domain is wrong.
      Signed-off-by: NSteffen Persvold <sp@numascale.com>
      Link: http://lkml.kernel.org/r/1331810188-24785-1-git-send-email-sp@numascale.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      943bc7e1
  8. 17 3月, 2012 2 次提交
  9. 14 3月, 2012 2 次提交
    • J
      crypto: camellia - add assembler implementation for x86_64 · 0b95ec56
      Jussi Kivilinna 提交于
      Patch adds x86_64 assembler implementation of Camellia block cipher. Two set of
      functions are provided. First set is regular 'one-block at time' encrypt/decrypt
      functions. Second is 'two-block at time' functions that gain performance increase
      on out-of-order CPUs. Performance of 2-way functions should be equal to 1-way
      functions with in-order CPUs.
      
      Patch has been tested with tcrypt and automated filesystem tests.
      
      Tcrypt benchmark results:
      
      AMD Phenom II 1055T (fam:16, model:10):
      
      camellia-asm vs camellia_generic:
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.27x   1.22x   1.30x   1.42x   1.30x   1.34x   1.19x   1.05x   1.23x   1.24x
      64B     1.74x   1.79x   1.43x   1.87x   1.81x   1.87x   1.48x   1.38x   1.55x   1.62x
      256B    1.90x   1.87x   1.43x   1.94x   1.94x   1.95x   1.63x   1.62x   1.67x   1.70x
      1024B   1.96x   1.93x   1.43x   1.95x   1.98x   2.01x   1.67x   1.69x   1.74x   1.80x
      8192B   1.96x   1.96x   1.39x   1.93x   2.01x   2.03x   1.72x   1.64x   1.71x   1.76x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.23x   1.23x   1.33x   1.39x   1.34x   1.38x   1.04x   1.18x   1.21x   1.29x
      64B     1.72x   1.69x   1.42x   1.78x   1.81x   1.89x   1.57x   1.52x   1.56x   1.65x
      256B    1.85x   1.88x   1.42x   1.86x   1.93x   1.96x   1.69x   1.65x   1.70x   1.75x
      1024B   1.88x   1.86x   1.45x   1.95x   1.96x   1.95x   1.77x   1.71x   1.77x   1.78x
      8192B   1.91x   1.86x   1.42x   1.91x   2.03x   1.98x   1.73x   1.71x   1.78x   1.76x
      
      camellia-asm vs aes-asm (8kB block):
               128bit  256bit
      ecb-enc  1.15x   1.22x
      ecb-dec  1.16x   1.16x
      cbc-enc  0.85x   0.90x
      cbc-dec  1.20x   1.23x
      ctr-enc  1.28x   1.30x
      ctr-dec  1.27x   1.28x
      lrw-enc  1.12x   1.16x
      lrw-dec  1.08x   1.10x
      xts-enc  1.11x   1.15x
      xts-dec  1.14x   1.15x
      
      Intel Core2 T8100 (fam:6, model:23, step:6):
      
      camellia-asm vs camellia_generic:
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.10x   1.12x   1.14x   1.16x   1.16x   1.15x   1.02x   1.02x   1.08x   1.08x
      64B     1.61x   1.60x   1.17x   1.68x   1.67x   1.66x   1.43x   1.42x   1.44x   1.42x
      256B    1.65x   1.73x   1.17x   1.77x   1.81x   1.80x   1.54x   1.53x   1.58x   1.54x
      1024B   1.76x   1.74x   1.18x   1.80x   1.85x   1.85x   1.60x   1.59x   1.65x   1.60x
      8192B   1.77x   1.75x   1.19x   1.81x   1.85x   1.86x   1.63x   1.61x   1.66x   1.62x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.10x   1.07x   1.13x   1.16x   1.11x   1.16x   1.03x   1.02x   1.08x   1.07x
      64B     1.61x   1.62x   1.15x   1.66x   1.63x   1.68x   1.47x   1.46x   1.47x   1.44x
      256B    1.71x   1.70x   1.16x   1.75x   1.69x   1.79x   1.58x   1.57x   1.59x   1.55x
      1024B   1.78x   1.72x   1.17x   1.75x   1.80x   1.80x   1.63x   1.62x   1.65x   1.62x
      8192B   1.76x   1.73x   1.17x   1.78x   1.80x   1.81x   1.64x   1.62x   1.68x   1.64x
      
      camellia-asm vs aes-asm (8kB block):
               128bit  256bit
      ecb-enc  1.17x   1.21x
      ecb-dec  1.17x   1.20x
      cbc-enc  0.80x   0.82x
      cbc-dec  1.22x   1.24x
      ctr-enc  1.25x   1.26x
      ctr-dec  1.25x   1.26x
      lrw-enc  1.14x   1.18x
      lrw-dec  1.13x   1.17x
      xts-enc  1.14x   1.18x
      xts-dec  1.14x   1.17x
      Signed-off-by: NJussi Kivilinna <jussi.kivilinna@mbnet.fi>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      0b95ec56
    • D
      x86/platform: Move APIC ID validity check into platform APIC code · fa63030e
      Daniel J Blueman 提交于
      Move APIC ID validity check into platform APIC code, so it can
      be overridden when needed. For NumaChip systems, always trust
      MADT, as it's constructed with high APIC IDs.
      
      Behaviour verifies on standard x86 systems and on NumaChip
      systems with this, and compile-tested with allyesconfig.
      Signed-off-by: NDaniel J Blueman <daniel@numascale-asia.com>
      Reviewed-by: NSteffen Persvold <sp@numascale.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Link: http://lkml.kernel.org/r/1331709454-27966-1-git-send-email-daniel@numascale-asia.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      fa63030e
  10. 13 3月, 2012 5 次提交
    • S
      sched/x86: Fix overflow in cyc2ns_offset · 9993bc63
      Salman Qazi 提交于
      When a machine boots up, the TSC generally gets reset.  However,
      when kexec is used to boot into a kernel, the TSC value would be
      carried over from the previous kernel.  The computation of
      cycns_offset in set_cyc2ns_scale is prone to an overflow, if the
      machine has been up more than 208 days prior to the kexec.  The
      overflow happens when we multiply *scale, even though there is
      enough room to store the final answer.
      
      We fix this issue by decomposing tsc_now into the quotient and
      remainder of division by CYC2NS_SCALE_FACTOR and then performing
      the multiplication separately on the two components.
      
      Refactor code to share the calculation with the previous
      fix in __cycles_2_ns().
      Signed-off-by: NSalman Qazi <sqazi@google.com>
      Acked-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      9993bc63
    • S
      x86/ioapic: Add register level checks to detect bogus io-apic entries · 73d63d03
      Suresh Siddha 提交于
      With the recent changes to clear_IO_APIC_pin() which tries to
      clear remoteIRR bit explicitly, some of the users started to see
      "Unable to reset IRR for apic .." messages.
      
      Close look shows that these are related to bogus IO-APIC entries
      which return's all 1's for their io-apic registers. And the
      above mentioned error messages are benign. But kernel should
      have ignored such io-apic's in the first place.
      
      Check if register 0, 1, 2 of the listed io-apic are all 1's and
      ignore such io-apic.
      Reported-by: NÁlvaro Castillo <midgoon@gmail.com>
      Tested-by: NJon Dufresne <jon@jondufresne.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: yinghai@kernel.org
      Cc: kernel-team@fedoraproject.org
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/1331577393.31585.94.camel@sbsiddha-desk.sc.intel.com
      [ Performed minor cleanup of affected code. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      73d63d03
    • P
      perf/x86: Prettify pmu config literals · f9b4eeb8
      Peter Zijlstra 提交于
      I got somewhat tired of having to decode hex numbers..
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Robert Richter <robert.richter@amd.com>
      Link: http://lkml.kernel.org/n/tip-0vsy1sgywc4uar3mu1szm0rg@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      f9b4eeb8
    • P
      perf/x86: Fix local vs remote memory events for NHM/WSM · 87e24f4b
      Peter Zijlstra 提交于
      Verified using the below proglet.. before:
      
      [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0
      remote write
      
       Performance counter stats for './numa 0':
      
               2,101,554 node-stores
               2,096,931 node-store-misses
      
             5.021546079 seconds time elapsed
      
      [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1
      local write
      
       Performance counter stats for './numa 1':
      
                 501,137 node-stores
                     199 node-store-misses
      
             5.124451068 seconds time elapsed
      
      After:
      
      [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0
      remote write
      
       Performance counter stats for './numa 0':
      
               2,107,516 node-stores
               2,097,187 node-store-misses
      
             5.012755149 seconds time elapsed
      
      [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1
      local write
      
       Performance counter stats for './numa 1':
      
               2,063,355 node-stores
                     165 node-store-misses
      
             5.082091494 seconds time elapsed
      
      #define _GNU_SOURCE
      
      #include <sched.h>
      #include <stdio.h>
      #include <errno.h>
      #include <sys/mman.h>
      #include <sys/types.h>
      #include <dirent.h>
      #include <signal.h>
      #include <unistd.h>
      #include <numaif.h>
      #include <stdlib.h>
      
      #define SIZE (32*1024*1024)
      
      volatile int done;
      
      void sig_done(int sig)
      {
      	done = 1;
      }
      
      int main(int argc, char **argv)
      {
      	cpu_set_t *mask, *mask2;
      	size_t size;
      	int i, err, t;
      	int nrcpus = 1024;
      	char *mem;
      	unsigned long nodemask = 0x01; /* node 0 */
      	DIR *node;
      	struct dirent *de;
      	int read = 0;
      	int local = 0;
      
      	if (argc < 2) {
      		printf("usage: %s [0-3]\n", argv[0]);
      		printf("  bit0 - local/remote\n");
      		printf("  bit1 - read/write\n");
      		exit(0);
      	}
      
      	switch (atoi(argv[1])) {
      	case 0:
      		printf("remote write\n");
      		break;
      	case 1:
      		printf("local write\n");
      		local = 1;
      		break;
      	case 2:
      		printf("remote read\n");
      		read = 1;
      		break;
      	case 3:
      		printf("local read\n");
      		local = 1;
      		read = 1;
      		break;
      	}
      
      	mask = CPU_ALLOC(nrcpus);
      	size = CPU_ALLOC_SIZE(nrcpus);
      	CPU_ZERO_S(size, mask);
      
      	node = opendir("/sys/devices/system/node/node0/");
      	if (!node)
      		perror("opendir");
      	while ((de = readdir(node))) {
      		int cpu;
      
      		if (sscanf(de->d_name, "cpu%d", &cpu) == 1)
      			CPU_SET_S(cpu, size, mask);
      	}
      	closedir(node);
      
      	mask2 = CPU_ALLOC(nrcpus);
      	CPU_ZERO_S(size, mask2);
      	for (i = 0; i < size; i++)
      		CPU_SET_S(i, size, mask2);
      	CPU_XOR_S(size, mask2, mask2, mask); // invert
      
      	if (!local)
      		mask = mask2;
      
      	err = sched_setaffinity(0, size, mask);
      	if (err)
      		perror("sched_setaffinity");
      
      	mem = mmap(0, SIZE, PROT_READ|PROT_WRITE,
      			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      	err = mbind(mem, SIZE, MPOL_BIND, &nodemask, 8*sizeof(nodemask), MPOL_MF_MOVE);
      	if (err)
      		perror("mbind");
      
      	signal(SIGALRM, sig_done);
      	alarm(5);
      
      	if (!read) {
      		while (!done) {
      			for (i = 0; i < SIZE; i++)
      				mem[i] = 0x01;
      		}
      	} else {
      		while (!done) {
      			for (i = 0; i < SIZE; i++)
      				t += *(volatile char *)(mem + i);
      		}
      	}
      
      	return 0;
      }
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/n/tip-tq73sxus35xmqpojf7ootxgs@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      87e24f4b
    • P
      sched: Cleanup cpu_active madness · 5fbd036b
      Peter Zijlstra 提交于
      Stepan found:
      
      CPU0		CPUn
      
      _cpu_up()
        __cpu_up()
      
      		boostrap()
      		  notify_cpu_starting()
      		  set_cpu_online()
      		  while (!cpu_active())
      		    cpu_relax()
      
      <PREEMPT-out>
      
      smp_call_function(.wait=1)
        /* we find cpu_online() is true */
        arch_send_call_function_ipi_mask()
      
        /* wait-forever-more */
      
      <PREEMPT-in>
      		  local_irq_enable()
      
        cpu_notify(CPU_ONLINE)
          sched_cpu_active()
            set_cpu_active()
      
      Now the purpose of cpu_active is mostly with bringing down a cpu, where
      we mark it !active to avoid the load-balancer from moving tasks to it
      while we tear down the cpu. This is required because we only update the
      sched_domain tree after we brought the cpu-down. And this is needed so
      that some tasks can still run while we bring it down, we just don't want
      new tasks to appear.
      
      On cpu-up however the sched_domain tree doesn't yet include the new cpu,
      so its invisible to the load-balancer, regardless of the active state.
      So instead of setting the active state after we boot the new cpu (and
      consequently having to wait for it before enabling interrupts) set the
      cpu active before we set it online and avoid the whole mess.
      Reported-by: NStepan Moskovchenko <stepanm@codeaurora.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
      5fbd036b
  11. 11 3月, 2012 2 次提交
    • K
      xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it. · 73c154c6
      Konrad Rzeszutek Wilk 提交于
      For the hypervisor to take advantage of the MWAIT support it needs
      to extract from the ACPI _CST the register address. But the
      hypervisor does not have the support to parse DSDT so it relies on
      the initial domain (dom0) to parse the ACPI Power Management information
      and push it up to the hypervisor. The pushing of the data is done
      by the processor_harveset_xen module which parses the information that
      the ACPI parser has graciously exposed in 'struct acpi_processor'.
      
      For the ACPI parser to also expose the Cx states for MWAIT, we need
      to expose the MWAIT capability (leaf 1). Furthermore we also need to
      expose the MWAIT_LEAF capability (leaf 5) for cstate.c to properly
      function.
      
      The hypervisor could expose these flags when it traps the XEN_EMULATE_PREFIX
      operations, but it can't do it since it needs to be backwards compatible.
      Instead we choose to use the native CPUID to figure out if the MWAIT
      capability exists and use the XEN_SET_PDC query hypercall to figure out
      if the hypervisor wants us to expose the MWAIT_LEAF capability or not.
      
      Note: The XEN_SET_PDC query was implemented in c/s 23783:
      "ACPI: add _PDC input override mechanism".
      
      With this in place, instead of
       C3 ACPI IOPORT 415
      we get now
       C3:ACPI FFH INTEL MWAIT 0x20
      
      Note: The cpu_idle which would be calling the mwait variants for idling
      never gets set b/c we set the default pm_idle to be the hypercall variant.
      Acked-by: NJan Beulich <JBeulich@suse.com>
      [v2: Fix missing header file include and #ifdef]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      73c154c6
    • K
      xen/setup/pm/acpi: Remove the call to boot_option_idle_override. · cc7335b2
      Konrad Rzeszutek Wilk 提交于
      We needed that call in the past to force the kernel to use
      default_idle (which called safe_halt, which called xen_safe_halt).
      
      But set_pm_idle_to_default() does now that, so there is no need
      to use this boot option operand.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      cc7335b2
  12. 10 3月, 2012 2 次提交
    • K
      gma500: initial medfield merge · 026abc33
      Kirill A. Shutemov 提交于
      We need to merge this ahead of some of the cleanup because a lot of needed
      cleanup spans both new and old chips. If we try and clean up and the merge
      we end up fighting ourselves.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [With a load of the cleanup stuff folded in, register stuff reworked sanely]
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      026abc33
    • T
      x86: Derandom delay_tsc for 64 bit · a7f4255f
      Thomas Gleixner 提交于
      Commit f0fbf0ab ("x86: integrate delay functions") converted
      delay_tsc() into a random delay generator for 64 bit.  The reason is
      that it merged the mostly identical versions of delay_32.c and
      delay_64.c.  Though the subtle difference of the result was:
      
       static void delay_tsc(unsigned long loops)
       {
      -	unsigned bclock, now;
      +	unsigned long bclock, now;
      
      Now the function uses rdtscl() which returns the lower 32bit of the
      TSC. On 32bit that's not problematic as unsigned long is 32bit. On 64
      bit this fails when the lower 32bit are close to wrap around when
      bclock is read, because the following check
      
             if ((now - bclock) >= loops)
             	  	break;
      
      evaluated to true on 64bit for e.g. bclock = 0xffffffff and now = 0
      because the unsigned long (now - bclock) of these values results in
      0xffffffff00000001 which is definitely larger than the loops
      value. That explains Tvortkos observation:
      
      "Because I am seeing udelay(500) (_occasionally_) being short, and
       that by delaying for some duration between 0us (yep) and 491us."
      
      Make those variables explicitely u32 again, so this works for both 32
      and 64 bit.
      Reported-by: NTvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # >= 2.6.27
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7f4255f
  13. 09 3月, 2012 1 次提交
  14. 08 3月, 2012 2 次提交