1. 16 10月, 2015 2 次提交
    • V
      x86/ioapic: Disable interrupts when re-routing legacy IRQs · c0ff971e
      Vitaly Kuznetsov 提交于
      A sporadic hang with consequent crash is observed when booting Hyper-V Gen1
      guests:
      
       Call Trace:
        <IRQ>
        [<ffffffff810ab68d>] ? trace_hardirqs_off+0xd/0x10
        [<ffffffff8107b616>] queue_work_on+0x46/0x90
        [<ffffffff81365696>] ? add_interrupt_randomness+0x176/0x1d0
        ...
        <EOI>
        [<ffffffff81471ddb>] ? _raw_spin_unlock_irqrestore+0x3b/0x60
        [<ffffffff810c295e>] __irq_put_desc_unlock+0x1e/0x40
        [<ffffffff810c5c35>] irq_modify_status+0xb5/0xd0
        [<ffffffff8104adbb>] mp_register_handler+0x4b/0x70
        [<ffffffff8104c55a>] mp_irqdomain_alloc+0x1ea/0x2a0
        [<ffffffff810c7f10>] irq_domain_alloc_irqs_recursive+0x40/0xa0
        [<ffffffff810c860c>] __irq_domain_alloc_irqs+0x13c/0x2b0
        [<ffffffff8104b070>] alloc_isa_irq_from_domain.isra.1+0xc0/0xe0
        [<ffffffff8104bfa5>] mp_map_pin_to_irq+0x165/0x2d0
        [<ffffffff8104c157>] pin_2_irq+0x47/0x80
        [<ffffffff81744253>] setup_IO_APIC+0xfe/0x802
        ...
        [<ffffffff814631c0>] ? rest_init+0x140/0x140
      
      The issue is easily reproducible with a simple instrumentation: if
      mdelay(10) is put between mp_setup_entry() and mp_register_handler() calls
      in mp_irqdomain_alloc() Hyper-V guest always fails to boot when re-routing
      IRQ0. The issue seems to be caused by the fact that we don't disable
      interrupts while doing IOPIC programming for legacy IRQs and IRQ0 actually
      happens. 
      
      Protect the setup sequence against concurrent interrupts.
      
      [ tglx: Make the protection unconditional and not only for legacy
        	interrupts ]
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Link: http://lkml.kernel.org/r/1444930943-19336-1-git-send-email-vkuznets@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c0ff971e
    • P
      x86/setup: Extend low identity map to cover whole kernel range · f5f3497c
      Paolo Bonzini 提交于
      On 32-bit systems, the initial_page_table is reused by
      efi_call_phys_prolog as an identity map to call
      SetVirtualAddressMap.  efi_call_phys_prolog takes care of
      converting the current CPU's GDT to a physical address too.
      
      For PAE kernels the identity mapping is achieved by aliasing the
      first PDPE for the kernel memory mapping into the first PDPE
      of initial_page_table.  This makes the EFI stub's trick "just work".
      
      However, for non-PAE kernels there is no guarantee that the identity
      mapping in the initial_page_table extends as far as the GDT; in this
      case, accesses to the GDT will cause a page fault (which quickly becomes
      a triple fault).  Fix this by copying the kernel mappings from
      swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
      identity mapping.
      
      For some reason, this is only reproducible with QEMU's dynamic translation
      mode, and not for example with KVM.  However, even under KVM one can clearly
      see that the page table is bogus:
      
          $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
          $ gdb
          (gdb) target remote localhost:1234
          (gdb) hb *0x02858f6f
          Hardware assisted breakpoint 1 at 0x2858f6f
          (gdb) c
          Continuing.
      
          Breakpoint 1, 0x02858f6f in ?? ()
          (gdb) monitor info registers
          ...
          GDT=     0724e000 000000ff
          IDT=     fffbb000 000007ff
          CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
          ...
      
      The page directory is sane:
      
          (gdb) x/4wx 0x32b7000
          0x32b7000:	0x03398063	0x03399063	0x0339a063	0x0339b063
          (gdb) x/4wx 0x3398000
          0x3398000:	0x00000163	0x00001163	0x00002163	0x00003163
          (gdb) x/4wx 0x3399000
          0x3399000:	0x00400003	0x00401003	0x00402003	0x00403003
      
      but our particular page directory entry is empty:
      
          (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
          0x32b7070:	0x00000000
      
      [ It appears that you can skate past this issue if you don't receive
        any interrupts while the bogus GDT pointer is loaded, or if you avoid
        reloading the segment registers in general.
      
        Andy Lutomirski provides some additional insight:
      
         "AFAICT it's entirely permissible for the GDTR and/or LDT
          descriptor to point to unmapped memory.  Any attempt to use them
          (segment loads, interrupts, IRET, etc) will try to access that memory
          as if the access came from CPL 0 and, if the access fails, will
          generate a valid page fault with CR2 pointing into the GDT or
          LDT."
      
        Up until commit 23a0d4e8 ("efi: Disable interrupts around EFI
        calls, not in the epilog/prolog calls") interrupts were disabled
        around the prolog and epilog calls, and the functional GDT was
        re-installed before interrupts were re-enabled.
      
        Which explains why no one has hit this issue until now. ]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: NLaszlo Ersek <lersek@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      [ Updated changelog. ]
      f5f3497c
  2. 02 10月, 2015 1 次提交
    • L
      x86/kexec: Fix kexec crash in syscall kexec_file_load() · e3c41e37
      Lee, Chun-Yi 提交于
      The original bug is a page fault crash that sometimes happens
      on big machines when preparing ELF headers:
      
          BUG: unable to handle kernel paging request at ffffc90613fc9000
          IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
      
      The bug is caused by us under-counting the number of memory ranges
      and subsequently not allocating enough ELF header space for them.
      The bug is typically masked on smaller systems, because the ELF header
      allocation is rounded up to the next page.
      
      This patch modifies the code in fill_up_crash_elf_data() by using
      walk_system_ram_res() instead of walk_system_ram_range() to correctly
      count the max number of crash memory ranges. That's because the
      walk_system_ram_range() filters out small memory regions that
      reside in the same page, but walk_system_ram_res() does not.
      
      Here's how I found the bug:
      
      After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
      the code uses walk_system_ram_res() to fill-in crash memory regions information
      to the program header, so it counts those small memory regions that
      reside in a page area.
      
      But, when the kernel was using walk_system_ram_range() in
      fill_up_crash_elf_data() to count the number of crash memory regions,
      it filters out small regions.
      
      I printed those small memory regions, for example:
      
        kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
      
      Based on the code in walk_system_ram_range(), this memory region
      will be filtered out:
      
        pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
        end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
        end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
      
      So, the max_nr_ranges that's counted by the kernel doesn't include
      small memory regions - causing us to under-allocate the required space.
      That causes the page fault crash that happens in a later code path
      when preparing ELF headers.
      
      This bug is not easy to reproduce on small machines that have few
      CPUs, because the allocated page aligned ELF buffer has more free
      space to cover those small memory regions' PT_LOAD headers.
      Signed-off-by: NLee, Chun-Yi <jlee@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kexec@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3c41e37
  3. 01 10月, 2015 2 次提交
    • T
      x86/process: Unify 32bit and 64bit implementations of get_wchan() · 7ba78053
      Thomas Gleixner 提交于
      The stack layout and the functionality is identical. Use the 64bit
      version for all of x86.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@alien8.de>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: kasan-dev <kasan-dev@googlegroups.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
      Link: http://lkml.kernel.org/r/20150930083302.779694618@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7ba78053
    • T
      x86/process: Add proper bound checks in 64bit get_wchan() · eddd3826
      Thomas Gleixner 提交于
      Dmitry Vyukov reported the following using trinity and the memory
      error detector AddressSanitizer
      (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel).
      
      [ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on
      address ffff88002e280000
      [ 124.576801] ffff88002e280000 is located 131938492886538 bytes to
      the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a)
      [ 124.578633] Accessed by thread T10915:
      [ 124.579295] inlined in describe_heap_address
      ./arch/x86/mm/asan/report.c:164
      [ 124.579295] #0 ffffffff810dd277 in asan_report_error
      ./arch/x86/mm/asan/report.c:278
      [ 124.580137] #1 ffffffff810dc6a0 in asan_check_region
      ./arch/x86/mm/asan/asan.c:37
      [ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0
      [ 124.581893] #3 ffffffff8107c093 in get_wchan
      ./arch/x86/kernel/process_64.c:444
      
      The address checks in the 64bit implementation of get_wchan() are
      wrong in several ways:
      
       - The lower bound of the stack is not the start of the stack
         page. It's the start of the stack page plus sizeof (struct
         thread_info)
      
       - The upper bound must be:
      
             top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long).
      
         The 2 * sizeof(unsigned long) is required because the stack pointer
         points at the frame pointer. The layout on the stack is: ... IP FP
         ... IP FP. So we need to make sure that both IP and FP are in the
         bounds.
      
      Fix the bound checks and get rid of the mix of numeric constants, u64
      and unsigned long. Making all unsigned long allows us to use the same
      function for 32bit as well.
      
      Use READ_ONCE() when accessing the stack. This does not prevent a
      concurrent wakeup of the task and the stack changing, but at least it
      avoids TOCTOU.
      
      Also check task state at the end of the loop. Again that does not
      prevent concurrent changes, but it avoids walking for nothing.
      
      Add proper comments while at it.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Based-on-patch-from: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@alien8.de>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: kasan-dev <kasan-dev@googlegroups.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      eddd3826
  4. 30 9月, 2015 1 次提交
  5. 25 9月, 2015 1 次提交
  6. 23 9月, 2015 2 次提交
  7. 18 9月, 2015 3 次提交
  8. 17 9月, 2015 1 次提交
  9. 16 9月, 2015 3 次提交
  10. 15 9月, 2015 2 次提交
    • S
      x86/apic: Serialize LVTT and TSC_DEADLINE writes · 5d7c631d
      Shaohua Li 提交于
      The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an
      MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not
      guaranteed that the write to LVTT has reached the APIC before the
      TSC_DEADLINE MSR is written. In such a case the write to the MSR is
      ignored and as a consequence the local timer interrupt never fires.
      
      The SDM decribes this issue for xAPIC and x2APIC modes. The
      serialization methods recommended by the SDM differ.
      
      xAPIC:
       "1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
        2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter.
        3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2.
        4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline."
      
      x2APIC:
       "To allow for efficient access to the APIC registers in x2APIC mode,
        the serializing semantics of WRMSR are relaxed when writing to the
        APIC registers. Thus, system software should not use 'WRMSR to APIC
        registers in x2APIC mode' as a serializing instruction. Read and write
        accesses to the APIC registers will occur in program order. A WRMSR to
        an APIC register may complete before all preceding stores are globally
        visible; software can prevent this by inserting a serializing
        instruction, an SFENCE, or an MFENCE before the WRMSR."
      
      The xAPIC method is to just wait for the memory mapped write to hit
      the LVTT by checking whether the MSR write has reached the hardware.
      There is no reason why a proper MFENCE after the memory mapped write would
      not do the same. Andi Kleen confirmed that MFENCE is sufficient for the
      xAPIC case as well.
      
      Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done
      unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE
      support.
      
      [ tglx: Massaged the changelog ]
      Signed-off-by: NShaohua Li <shli@fb.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: <Kernel-team@fb.com>
      Cc: <lenb@kernel.org>
      Cc: <fenghua.yu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: stable@vger.kernel.org #v3.7+
      Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5d7c631d
    • T
      x86/ioapic: Force affinity setting in setup_ioapic_dest() · 4857c91f
      Thomas Gleixner 提交于
      The recent ioapic cleanups changed the affinity setting in
      setup_ioapic_dest() from a direct write to the hardware to the delayed
      affinity setup via irq_set_affinity().
      
      That results in a warning from chained_irq_exit():
      WARNING: CPU: 0 PID: 5 at kernel/irq/migration.c:32 irq_move_masked_irq
      [<ffffffff810a0a88>] irq_move_masked_irq+0xb8/0xc0
      [<ffffffff8103c161>] ioapic_ack_level+0x111/0x130
      [<ffffffff812bbfe8>] intel_gpio_irq_handler+0x148/0x1c0
      
      The reason is that irq_set_affinity() does not write directly to the
      hardware. It marks the affinity setting as pending and executes it
      from the next interrupt. The chained handler infrastructure does not
      take the irq descriptor lock for performance reasons because such a
      chained interrupt is not visible to any interfaces. So the delayed
      affinity setting triggers the warning in irq_move_masked_irq().
      
      Restore the old behaviour by calling the set_affinity function of the
      ioapic chip in setup_ioapic_dest(). This is safe as none of the
      interrupts can be on the fly at this point.
      
      Fixes: aa5cb97f 'x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces'
      Reported-and-tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: jarkko.nikula@linux.intel.com
      4857c91f
  11. 14 9月, 2015 1 次提交
    • J
      x86/ldt: Fix small LDT allocation for Xen · f454b478
      Jan Beulich 提交于
      While the following commit:
      
        37868fe1 ("x86/ldt: Make modify_ldt synchronous")
      
      added a nice comment explaining that Xen needs page-aligned
      whole page chunks for guest descriptor tables, it then
      nevertheless used kzalloc() on the small size path.
      
      As I'm unaware of guarantees for kmalloc(PAGE_SIZE, ) to return
      page-aligned memory blocks, I believe this needs to be switched
      back to __get_free_page() (or better get_zeroed_page()).
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/55E735D6020000780009F1E6@prv-mh.provo.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f454b478
  12. 13 9月, 2015 2 次提交
  13. 11 9月, 2015 4 次提交
    • A
      perf/x86/intel/bts: Set event->hw.itrace_started in pmu::start to match the new logic · d2498729
      Alexander Shishkin 提交于
      Since event->hw.itrace_started is now set in pmu::start() to signal the beginning of
      the trace, do so also in the intel_bts driver.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/1437140050-23363-4-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d2498729
    • C
      dma-mapping: consolidate dma_set_mask · 452e06af
      Christoph Hellwig 提交于
      Almost everyone implements dma_set_mask the same way, although some time
      that's hidden in ->set_dma_mask methods.
      
      This patch consolidates those into a common implementation that either
      calls ->set_dma_mask if present or otherwise uses the default
      implementation.  Some architectures used to only call ->set_dma_mask
      after the initial checks, and those instance have been fixed to do the
      full work.  h8300 implemented dma_set_mask bogusly as a no-ops and has
      been fixed.
      
      Unfortunately some architectures overload unrelated semantics like changing
      the dma_ops into it so we still need to allow for an architecture override
      for now.
      
      [jcmvbkbc@gmail.com: fix xtensa]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      452e06af
    • C
      dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent} · 6894258e
      Christoph Hellwig 提交于
      Since 2009 we have a nice asm-generic header implementing lots of DMA API
      functions for architectures using struct dma_map_ops, but unfortunately
      it's still missing a lot of APIs that all architectures still have to
      duplicate.
      
      This series consolidates the remaining functions, although we still need
      arch opt outs for two of them as a few architectures have very
      non-standard implementations.
      
      This patch (of 5):
      
      The coherent DMA allocator works the same over all architectures supporting
      dma_map operations.
      
      This patch consolidates them and converges the minor differences:
      
       - the debug_dma helpers are now called from all architectures, including
         those that were previously missing them
       - dma_alloc_from_coherent and dma_release_from_coherent are now always
         called from the generic alloc/free routines instead of the ops
         dma-mapping-common.h always includes dma-coherent.h to get the defintions
         for them, or the stubs if the architecture doesn't support this feature
       - checks for ->alloc / ->free presence are removed.  There is only one
         magic instead of dma_map_ops without them (mic_dma_ops) and that one
         is x86 only anyway.
      
      Besides that only x86 needs special treatment to replace a default devices
      if none is passed and tweak the gfp_flags.  An optional arch hook is provided
      for that.
      
      [linux@roeck-us.net: fix build]
      [jcmvbkbc@gmail.com: fix xtensa]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6894258e
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
  14. 09 9月, 2015 1 次提交
    • M
      x86: use generic early mem copy · 5dd2c4bd
      Mark Salter 提交于
      The early_ioremap library now has a generic copy_from_early_mem()
      function.  Use the generic copy function for x86 relocate_initrd().
      
      [akpm@linux-foundation.org: remove MAX_MAP_CHUNK define, per Yinghai Lu]
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dd2c4bd
  15. 05 9月, 2015 4 次提交
  16. 04 9月, 2015 1 次提交
    • T
      x86/alternatives: Make optimize_nops() interrupt safe and synced · 66c117d7
      Thomas Gleixner 提交于
      Richard reported the following crash:
      
      [    0.036000] BUG: unable to handle kernel paging request at 55501e06
      [    0.036000] IP: [<c0aae48b>] common_interrupt+0xb/0x38
      [    0.036000] Call Trace:
      [    0.036000]  [<c0409c80>] ? add_nops+0x90/0xa0
      [    0.036000]  [<c040a054>] apply_alternatives+0x274/0x630
      
      Chuck decoded:
      
       "  0:   8d 90 90 83 04 24       lea    0x24048390(%eax),%edx
          6:   80 fc 0f                cmp    $0xf,%ah
          9:   a8 0f                   test   $0xf,%al
       >> b:   a0 06 1e 50 55          mov    0x55501e06,%al
         10:   57                      push   %edi
         11:   56                      push   %esi
      
       Interrupt 0x30 occurred while the alternatives code was replacing the
       initial 0x90,0x90,0x90 NOPs (from the ASM_CLAC macro) with the
       optimized version, 0x8d,0x76,0x00. Only the first byte has been
       replaced so far, and it makes a mess out of the insn decoding."
      
      optimize_nops() is buggy in two aspects:
      
      - It's not disabling interrupts across the modification
      - It's lacking a sync_core() call
      
      Add both.
      
      Fixes: 4fd4b6e5 'x86/alternatives: Use optimized NOPs for padding'
      Reported-and-tested-by: N"Richard W.M. Jones" <rjones@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Richard W.M. Jones <rjones@redhat.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1509031232340.15006@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      66c117d7
  17. 28 8月, 2015 2 次提交
    • T
      x86/irq: Do not dereference irq descriptor before checking it · a47d4576
      Thomas Gleixner 提交于
      Having the IS_NULL_OR_ERR() check after dereferencing the pointer is
      not really working well.
      
      Move the dereference after the check.
      
      Fixes: a782a7e4 'x86/irq: Store irq descriptor in vector array'
      Reported-and-tested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      a47d4576
    • L
      x86/mm/mtrr: Remove kernel internal MTRR interfaces: unexport mtrr_add() and mtrr_del() · 2baa891e
      Luis R. Rodriguez 提交于
      The effort to replace mtrr_add() with architecture agnostic
      arch_phys_wc_add() is complete, this will ensure write-combining
      implementations (PAT on x86) is taken advantage instead of using
      MTRR. With the effort done now, hide direct MTRR access for
      drivers.
      
      The legacy user-space /proc/mtrr ABI is not affected.
      
      Update x86 documentation on MTRR to reflect the completion of
      the phasing out of direct access to MTRR, also add a note on
      platform firmware code use of MTRRs based on the obituary
      discussion of MTRRs on Linux [0].
      
        [0] http://lkml.kernel.org/r/1438991330.3109.196.camel@hp.comSigned-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: <syrjala@sci.fi>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Walls <awalls@md.metrocast.net>
      Cc: Antonino Daplas <adaplas@gmail.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suresh Siddha <sbsiddha@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Ville Syrjälä <syrjala@sci.fi>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: airlied@linux.ie
      Cc: benh@kernel.crashing.org
      Cc: bhelgaas@google.com
      Cc: dan.j.williams@intel.com
      Cc: konrad.wilk@oracle.com
      Cc: linux-fbdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-media@vger.kernel.org
      Cc: mst@redhat.com
      Cc: netdev@vger.kernel.org
      Cc: vinod.koul@intel.com
      Cc: xen-devel@lists.xensource.com
      Link: http://lkml.kernel.org/r/1440443613-13696-12-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2baa891e
  18. 27 8月, 2015 1 次提交
    • J
      ACPI, PCI: Penalize legacy IRQ used by ACPI SCI · 5d0ddfeb
      Jiang Liu 提交于
      Nick Meier reported a regression with HyperV that "
        After rebooting the VM, the following messages are logged in syslog
        when trying to load the tulip driver:
          tulip: Linux Tulip drivers version 1.1.15 (Feb 27, 2007)
          tulip: 0000:00:0a.0: PCI INT A: failed to register GSI
          tulip: Cannot enable tulip board #0, aborting
          tulip: probe of 0000:00:0a.0 failed with error -16
        Errors occur in 3.19.0 kernel
        Works in 3.17 kernel.
      "
      
      According to the ACPI dump file posted by Nick at
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1440072
      
      The ACPI MADT table includes an interrupt source overridden entry for
      ACPI SCI:
      [236h 0566  1]                Subtable Type : 02 <Interrupt Source Override>
      [237h 0567  1]                       Length : 0A
      [238h 0568  1]                          Bus : 00
      [239h 0569  1]                       Source : 09
      [23Ah 0570  4]                    Interrupt : 00000009
      [23Eh 0574  2]        Flags (decoded below) : 000D
                                         Polarity : 1
                                     Trigger Mode : 3
      
      And in DSDT table, we have _PRT method to define PCI interrupts, which
      eventually goes to:
              Name (PRSA, ResourceTemplate ()
              {
                  IRQ (Level, ActiveLow, Shared, )
                      {3,4,5,7,9,10,11,12,14,15}
              })
              Name (PRSB, ResourceTemplate ()
              {
                  IRQ (Level, ActiveLow, Shared, )
                      {3,4,5,7,9,10,11,12,14,15}
              })
              Name (PRSC, ResourceTemplate ()
              {
                  IRQ (Level, ActiveLow, Shared, )
                      {3,4,5,7,9,10,11,12,14,15}
              })
              Name (PRSD, ResourceTemplate ()
              {
                  IRQ (Level, ActiveLow, Shared, )
                      {3,4,5,7,9,10,11,12,14,15}
              })
      
      According to the MADT and DSDT tables, IRQ 9 may be used for:
       1) ACPI SCI in level, high mode
       2) PCI legacy IRQ in level, low mode
      So there's a conflict in polarity setting for IRQ 9.
      
      Prior to commit cd68f6bd ("x86, irq, acpi: Get rid of special
      handling of GSI for ACPI SCI"), ACPI SCI is handled specially and
      there's no check for conflicts between ACPI SCI and PCI legagy IRQ.
      And it seems that the HyperV hypervisor doesn't make use of the
      polarity configuration in IOAPIC entry, so it just works.
      
      Commit cd68f6bd gets rid of the specially handling of ACPI SCI,
      and then the pin attribute checking code discloses the conflicts
      between ACPI SCI and PCI legacy IRQ on HyperV virtual machine,
      and rejects the request to assign IRQ9 to PCI devices.
      
      So penalize legacy IRQ used by ACPI SCI and mark it unusable if ACPI
      SCI attributes conflict with PCI IRQ attributes.
      
      Please refer to following links for more information:
      https://bugzilla.kernel.org/show_bug.cgi?id=101301
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1440072
      
      Fixes: cd68f6bd ("x86, irq, acpi: Get rid of special handling of GSI for ACPI SCI")
      Reported-and-tested-by: NNick Meier <nmeier@microsoft.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: 3.19+ <stable@vger.kernel.org> # 3.19+
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5d0ddfeb
  19. 25 8月, 2015 1 次提交
  20. 23 8月, 2015 1 次提交
  21. 22 8月, 2015 4 次提交
    • T
      x86/apic: Fix fallout from x2apic cleanup · a57e456a
      Thomas Gleixner 提交于
      In the recent x2apic cleanup I got two things really wrong:
      1) The safety check in __disable_x2apic which allows the function to
         be called unconditionally is backwards. The check is there to
         prevent access to the apic MSR in case that the machine has no
         apic. Though right now it returns if the machine has an apic and
         therefor the disabling of x2apic is never invoked.
      
      2) x2apic_disable() sets x2apic_mode to 0 after registering the local
         apic. That's wrong, because register_lapic_address() checks x2apic
         mode and therefor takes the wrong code path.
      
      This results in boot failures on machines with x2apic preenabled by
      BIOS and can also lead to an fatal MSR access on machines without
      apic.
      
      The solutions are simple:
      1) Correct the sanity check for apic availability
      2) Clear x2apic_mode _before_ calling register_lapic_address()
      
      Fixes: 659006bf 'x86/x2apic: Split enable and setup function'
      Reported-and-tested-by: NJavier Monteagudo <javiermon@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1224764
      Cc: stable@vger.kernel.org # 4.0+
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      a57e456a
    • H
      x86/asm/delay: Introduce an MWAITX-based delay with a configurable timer · b466bdb6
      Huang Rui 提交于
      MWAITX can enable a timer and a corresponding timer value
      specified in SW P0 clocks. The SW P0 frequency is the same as
      TSC. The timer provides an upper bound on how long the
      instruction waits before exiting.
      
      This way, a delay function in the kernel can leverage that
      MWAITX timer of MWAITX.
      
      When a CPU core executes MWAITX, it will be quiesced in a
      waiting phase, diminishing its power consumption. This way, we
      can save power in comparison to our default TSC-based delays.
      
      A simple test shows that:
      
      	$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
      	$ sleep 10000s
      	$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
      
      Results:
      
      	* TSC-based default delay:      485115 uWatts average power
      	* MWAITX-based delay:           252738 uWatts average power
      
      Thus, that's about 240 milliWatts less power consumption. The
      test method relies on the support of AMD CPU accumulated power
      algorithm in fam15h_power for which patches are forthcoming.
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Suggested-by: NBorislav Petkov <bp@suse.de>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      [ Fix delay truncation. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andreas Herrmann <herrmann.der.user@gmail.com>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Jacob Shin <jacob.w.shin@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Li <tony.li@amd.com>
      Link: http://lkml.kernel.org/r/1438744732-1459-3-git-send-email-ray.huang@amd.com
      Link: http://lkml.kernel.org/r/1439201994-28067-4-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b466bdb6
    • A
      x86/traps: Weaken context tracking entry assertions · f0a97af8
      Andy Lutomirski 提交于
      We were asserting that we were all the way in CONTEXT_KERNEL
      when exception handlers were called.  While having this be true
      is, I think, a nice goal (or maybe a variant in which we assert
      that we're in CONTEXT_KERNEL or some new IRQ context), we're not
      quite there.
      
      In particular, if an IRQ interrupts the SYSCALL prologue and the
      IRQ handler in turn causes an exception, the exception entry
      will be called in RCU IRQ mode but with CONTEXT_USER.
      
      This is okay (nothing goes wrong), but until we fix up the
      SYSCALL prologue, we need to avoid warning.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/c81faf3916346c0e04346c441392974f49cd7184.1440133286.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0a97af8
    • I
      x86/fpu/math-emu: Fix crash in fork() · 827409b2
      Ingo Molnar 提交于
      During later stages of math-emu bootup the following crash triggers:
      
      	 math_emulate: 0060:c100d0a8
      	 Kernel panic - not syncing: Math emulation needed in kernel
      	 CPU: 0 PID: 1511 Comm: login Not tainted 4.2.0-rc7+ #1012
      	 [...]
      	 Call Trace:
      	  [<c181d50d>] dump_stack+0x41/0x52
      	  [<c181c918>] panic+0x77/0x189
      	  [<c1003530>] ? math_error+0x140/0x140
      	  [<c164c2d7>] math_emulate+0xba7/0xbd0
      	  [<c100d0a8>] ? fpu__copy+0x138/0x1c0
      	  [<c1109c3c>] ? __alloc_pages_nodemask+0x12c/0x870
      	  [<c136ac20>] ? proc_clear_tty+0x40/0x70
      	  [<c136ac6e>] ? session_clear_tty+0x1e/0x30
      	  [<c1003530>] ? math_error+0x140/0x140
      	  [<c1003575>] do_device_not_available+0x45/0x70
      	  [<c100d0a8>] ? fpu__copy+0x138/0x1c0
      	  [<c18258e6>] error_code+0x5a/0x60
      	  [<c1003530>] ? math_error+0x140/0x140
      	  [<c100d0a8>] ? fpu__copy+0x138/0x1c0
      	  [<c100c205>] arch_dup_task_struct+0x25/0x30
      	  [<c1048cea>] copy_process.part.51+0xea/0x1480
      	  [<c115a8e5>] ? dput+0x175/0x200
      	  [<c136af70>] ? no_tty+0x30/0x30
      	  [<c1157242>] ? do_vfs_ioctl+0x322/0x540
      	  [<c104a21a>] _do_fork+0xca/0x340
      	  [<c1057b06>] ? SyS_rt_sigaction+0x66/0x90
      	  [<c104a557>] SyS_clone+0x27/0x30
      	  [<c1824a80>] sysenter_do_call+0x12/0x12
      
      The reason is the incorrect assumption in fpu_copy(), that FNSAVE
      can be executed from math-emu kernels as well.
      
      Don't try to copy the registers, the soft state will be copied
      by fork anyway, so the child task inherits the parent task's
      soft math state.
      
      With this fix applied math-emu kernels boot up fine on modern
      hardware and the 'no387 nofxsr' boot options.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      827409b2