1. 16 2月, 2016 4 次提交
    • D
      x86/cpu, x86/mm/pkeys: Define new CR4 bit · f28b49d2
      Dave Hansen 提交于
      There is a new bit in CR4 for enabling protection keys.  We
      will actually enable it later in the series.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210202.3CFC3DB2@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f28b49d2
    • D
      x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitions · dfb4a70f
      Dave Hansen 提交于
      There are two CPUID bits for protection keys.  One is for whether
      the CPU contains the feature, and the other will appear set once
      the OS enables protection keys.  Specifically:
      
      	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
      	Protection keys (and the RDPKRU/WRPKRU instructions)
      
      This is because userspace can not see CR4 contents, but it can
      see CPUID contents.
      
      X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
      
      	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]
      
      X86_FEATURE_OSPKE is "OSPKU":
      
      	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]
      
      These are the first CPU features which need to look at the
      ECX word in CPUID leaf 0x7, so this patch also includes
      fetching that word in to the cpuinfo->x86_capability[] array.
      
      Add it to the disabled-features mask when its config option is
      off.  Even though we are not using it here, we also extend the
      REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
      DISABLED_MASK_BIT_SET() version.
      
      This means that in almost all code, you should use:
      
      	cpu_has(c, X86_FEATURE_PKU)
      
      and *not* the CONFIG option.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210201.7714C250@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dfb4a70f
    • D
      x86/fpu: Add placeholder for 'Processor Trace' XSAVE state · 1f96b1ef
      Dave Hansen 提交于
      There is an XSAVE state component for Intel Processor Trace (PT).
      But, we do not currently use it.
      
      We add a placeholder in the code for it so it is not a mystery and
      also so we do not need an explicit enum initialization for Protection
      Keys in a moment.
      
      Why don't we use it?
      
      We might end up using this at _some_ point in the future.  But,
      this is a "system" state which requires using the currently
      unsupported XSAVES feature.  Unlike all the other XSAVE states,
      PT state is also not directly tied to a thread.  You might
      context-switch between threads, but not want to change any of the
      PT state.  Or, you might switch between threads, and *do* want to
      change PT state, all depending on what is being traced.
      
      We currently just manually set some MSRs to do this PT context
      switching, and it is unclear whether replacing our direct MSR use
      with XSAVE will be a net win or loss, both in code complexity and
      performance.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: fenghua.yu@intel.com
      Cc: linux-mm@kvack.org
      Cc: yu-cheng.yu@intel.com
      Link: http://lkml.kernel.org/r/20160212210158.5E4BCAE2@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f96b1ef
    • B
      x86/cpufeature: Speed up cpu_feature_enabled() · f2cc8e07
      Borislav Petkov 提交于
      When GCC cannot do constant folding for this macro, it falls back to
      cpu_has(). But static_cpu_has() is optimal and it works at all times
      now. So use it and speedup the fallback case.
      
      Before we had this:
      
        mov    0x99d674(%rip),%rdx        # ffffffff81b0d9f4 <boot_cpu_data+0x34>
        shr    $0x2e,%rdx
        and    $0x1,%edx
        jne    ffffffff811704e9 <do_munmap+0x3f9>
      
      After alternatives patching, it turns into:
      
      		  jmp    0xffffffff81170390
      		  nopl   (%rax)
      		  ...
      		  callq  ffffffff81056e00 <mpx_notify_unmap>
      ffffffff81170390: mov    0x170(%r12),%rdi
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1455578358-28347-1-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f2cc8e07
  2. 14 2月, 2016 1 次提交
    • B
      x86/mm: Fix INVPCID asm constraint · e2c7698c
      Borislav Petkov 提交于
      So we want to specify the dependency on both @pcid and @addr so that the
      compiler doesn't reorder accesses to them *before* the TLB flush. But
      for that to work, we need to express this properly in the inline asm and
      deref the whole desc array, not the pointer to it. See clwb() for an
      example.
      
      This fixes the build error on 32-bit:
      
        arch/x86/include/asm/tlbflush.h: In function ‘__invpcid’:
        arch/x86/include/asm/tlbflush.h:26:18: error: memory input 0 is not directly addressable
      
      which gcc4.7 caught but 5.x didn't. Which is strange. :-\
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Michael Matz <matz@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e2c7698c
  3. 09 2月, 2016 5 次提交
    • A
      x86/fpu: Fix math emulation in eager fpu mode · 4ecd16ec
      Andy Lutomirski 提交于
      Systems without an FPU are generally old and therefore use lazy FPU
      switching. Unsurprisingly, math emulation in eager FPU mode is a
      bit buggy. Fix it.
      
      There were two bugs involving kernel code trying to use the FPU
      registers in eager mode even if they didn't exist and one BUG_ON()
      that was incorrect.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: yu-cheng yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/b4b8d112436bd6fab866e1b4011131507e8d7fbe.1453675014.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4ecd16ec
    • A
      x86/dmi: Switch dmi_remap() from ioremap() [uncached] to ioremap_cache() · ce1143aa
      Andy Lutomirski 提交于
      DMI cacheability is very confused on x86.
      
      dmi_early_remap() uses early_ioremap(), which uses FIXMAP_PAGE_IO,
      which is __PAGE_KERNEL_IO, which is __PAGE_KERNEL, which is cached.
      
      Don't ask me why this makes any sense.
      
      dmi_remap() uses ioremap(), which requests an uncached mapping.
      
      However, on non-EFI systems, the DMI data generally lives between
      0xf0000 and 0x100000, which is in the legacy ISA range, which
      triggers a special case in the PAT code that overrides the cache
      mode requested by ioremap() and forces a WB mapping.
      
      On a UEFI boot, however, the DMI table can live at any physical
      address.  On my laptop, it's around 0x77dd0000.  That's nowhere near
      the legacy ISA range, so the ioremap() implicit uncached type is
      honored and we end up with a UC- mapping.
      
      UC- is a very, very slow way to read from main memory, so dmi_walk()
      is likely to take much longer than necessary.
      
      Given that, even on UEFI, we do early cached DMI reads, it seems
      safe to just ask for cached access.  Switch to ioremap_cache().
      
      I haven't tried to benchmark this, but I'd guess it saves several
      milliseconds of boot time.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jean Delvare <jdelvare@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Link: http://lkml.kernel.org/r/3147c38e51f439f3c8911db34c7d4ab22d854915.1453791969.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ce1143aa
    • A
      x86/mm: If INVPCID is available, use it to flush global mappings · d8bced79
      Andy Lutomirski 提交于
      On my Skylake laptop, INVPCID function 2 (flush absolutely
      everything) takes about 376ns, whereas saving flags, twiddling
      CR4.PGE to flush global mappings, and restoring flags takes about
      539ns.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/ed0ef62581c0ea9c99b9bf6df726015e96d44743.1454096309.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d8bced79
    • A
      x86/mm: Add INVPCID helpers · 060a402a
      Andy Lutomirski 提交于
      This adds helpers for each of the four currently-specified INVPCID
      modes.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/8a62b23ad686888cee01da134c91409e22064db9.1454096309.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      060a402a
    • D
      x86/asm/bitops: Force inlining of test_and_set_bit and friends · 8dd5032d
      Denys Vlasenko 提交于
      Sometimes GCC mysteriously doesn't inline very small functions
      we expect to be inlined, see:
      
        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      
      Arguably, GCC should do better, but GCC people aren't willing
      to invest time into it and are asking to use __always_inline
      instead.
      
      With this .config:
      
        http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
      
      here's an example of functions getting deinlined many times:
      
        test_and_set_bit (166 copies, ~1260 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f ab 3e          lock bts %rdi,(%rsi)
               72 04                   jb     <test_and_set_bit+0xf>
               31 c0                   xor    %eax,%eax
               eb 05                   jmp    <test_and_set_bit+0x14>
               b8 01 00 00 00          mov    $0x1,%eax
               5d                      pop    %rbp
               c3                      retq
      
        test_and_clear_bit (124 copies, ~1000 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f b3 3e          lock btr %rdi,(%rsi)
               72 04                   jb     <test_and_clear_bit+0xf>
               31 c0                   xor    %eax,%eax
               eb 05                   jmp    <test_and_clear_bit+0x14>
               b8 01 00 00 00          mov    $0x1,%eax
               5d                      pop    %rbp
               c3                      retq
      
        change_bit (3 copies, 8 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f bb 3e          lock btc %rdi,(%rsi)
               5d                      pop    %rbp
               c3                      retq
      
        clear_bit_unlock (2 copies, 11 calls)
               55                      push   %rbp
               48 89 e5                mov    %rsp,%rbp
               f0 48 0f b3 3e          lock btr %rdi,(%rsi)
               5d                      pop    %rbp
               c3                      retq
      
      This patch works it around via s/inline/__always_inline/.
      
      Code size decrease by ~13.5k after the patch:
      
            text     data      bss       dec    filename
        92110727 20826144 36417536 149354407    vmlinux.before
        92097234 20826176 36417536 149340946    vmlinux.after
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Graf <tgraf@suug.ch>
      Link: http://lkml.kernel.org/r/1454881887-1367-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8dd5032d
  4. 05 2月, 2016 1 次提交
    • D
      x86: Fix KASAN false positives in thread_saved_pc() · 75edb54a
      Dmitry Vyukov 提交于
      thread_saved_pc() reads stack of a potentially running task.
      This can cause false KASAN stack-out-of-bounds reports,
      because the running task concurrently poisons and unpoisons
      own stack.
      
      The same happens in get_wchan(), and get get_wchan() was fixed
      by using READ_ONCE_NOCHECK(). Do the same here.
      
      Example KASAN report triggered by sysrq-t:
      
        BUG: KASAN: out-of-bounds in sched_show_task+0x306/0x3b0 at addr ffff880043c97c18
        Read of size 8 by task syz-executor/23839
        [...]
        page dumped because: kasan: bad access detected
        [...]
        Call Trace:
         [<ffffffff8175ea0e>] __asan_report_load8_noabort+0x3e/0x40
         [<ffffffff813e7a26>] sched_show_task+0x306/0x3b0
         [<ffffffff813e7bf4>] show_state_filter+0x124/0x1a0
         [<ffffffff82d2ca00>] fn_show_state+0x10/0x20
         [<ffffffff82d2cf98>] k_spec+0xa8/0xe0
         [<ffffffff82d3354f>] kbd_event+0xb9f/0x4000
         [<ffffffff843ca8a7>] input_to_handler+0x3a7/0x4b0
         [<ffffffff843d1954>] input_pass_values.part.5+0x554/0x6b0
         [<ffffffff843d29bc>] input_handle_event+0x2ac/0x1070
         [<ffffffff843d3a47>] input_inject_event+0x237/0x280
         [<ffffffff843e8c28>] evdev_write+0x478/0x680
         [<ffffffff817ac653>] __vfs_write+0x113/0x480
         [<ffffffff817ae0e7>] vfs_write+0x167/0x4a0
         [<ffffffff817b13d1>] SyS_write+0x111/0x220
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: glider@google.com
      Cc: kasan-dev@googlegroups.com
      Cc: kcc@google.com
      Cc: linux-kernel@vger.kernel.org
      Cc: ryabinin.a.a@gmail.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      75edb54a
  5. 30 1月, 2016 4 次提交
    • B
      x86/alternatives: Discard dynamic check after init · 2476f2fa
      Brian Gerst 提交于
      Move the code to do the dynamic check to the altinstr_aux
      section so that it is discarded after alternatives have run and
      a static branch has been chosen.
      
      This way we're changing the dynamic branch from C code to
      assembly, which makes it *substantially* smaller while avoiding
      a completely unnecessary call to an out of line function.
      Signed-off-by: NBrian Gerst <brgerst@gmail.com>
      [ Changed it to do TESTB, as hpa suggested. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1452972124-7380-1-git-send-email-brgerst@gmail.com
      Link: http://lkml.kernel.org/r/20160127084525.GC30712@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2476f2fa
    • B
      x86/cpufeature: Get rid of the non-asm goto variant · a362bf9f
      Borislav Petkov 提交于
      I can simply quote hpa from the mail:
      
        "Get rid of the non-asm goto variant and just fall back to
         dynamic if asm goto is unavailable. It doesn't make any sense,
         really, if it is supposed to be safe, and by now the asm
         goto-capable gcc is in more wide use. (Originally the gcc 3.x
         fallback to pure dynamic didn't exist, either.)"
      
      Booy, am I lazy.
      
      Cleanup the whole CC_HAVE_ASM_GOTO ifdeffery too, while at it.
      Suggested-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160127084325.GB30712@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a362bf9f
    • B
      x86/cpufeature: Replace the old static_cpu_has() with safe variant · bc696ca0
      Borislav Petkov 提交于
      So the old one didn't work properly before alternatives had run.
      And it was supposed to provide an optimized JMP because the
      assumption was that the offset it is jumping to is within a
      signed byte and thus a two-byte JMP.
      
      So I did an x86_64 allyesconfig build and dumped all possible
      sites where static_cpu_has() was used. The optimization amounted
      to all in all 12(!) places where static_cpu_has() had generated
      a 2-byte JMP. Which has saved us a whopping 36 bytes!
      
      This clearly is not worth the trouble so we can remove it. The
      only place where the optimization might count - in __switch_to()
      - we will handle differently. But that's not subject of this
      patch.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bc696ca0
    • B
      x86/cpufeature: Carve out X86_FEATURE_* · cd4d09ec
      Borislav Petkov 提交于
      Move them to a separate header and have the following
      dependency:
      
        x86/cpufeatures.h <- x86/processor.h <- x86/cpufeature.h
      
      This makes it easier to use the header in asm code and not
      include the whole cpufeature.h and add guards for asm.
      Suggested-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1453842730-28463-5-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd4d09ec
  6. 27 1月, 2016 1 次提交
  7. 23 1月, 2016 1 次提交
    • R
      pmem: add wb_cache_pmem() to the PMEM API · 3f4a2670
      Ross Zwisler 提交于
      __arch_wb_cache_pmem() was already an internal implementation detail of
      the x86 PMEM API, but this functionality needs to be exported as part of
      the general PMEM API to handle the fsync/msync case for DAX mmaps.
      
      One thing worth noting is that we really do want this to be part of the
      PMEM API as opposed to a stand-alone function like clflush_cache_range()
      because of ordering restrictions.  By having wb_cache_pmem() as part of
      the PMEM API we can leave it unordered, call it multiple times to write
      back large amounts of memory, and then order the multiple calls with a
      single wmb_pmem().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f4a2670
  8. 21 1月, 2016 1 次提交
    • C
      dma-mapping: always provide the dma_map_ops based implementation · e1c7e324
      Christoph Hellwig 提交于
      Move the generic implementation to <linux/dma-mapping.h> now that all
      architectures support it and remove the HAVE_DMA_ATTR Kconfig symbol now
      that everyone supports them.
      
      [valentinrothberg@gmail.com: remove leftovers in Kconfig]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NValentin Rothberg <valentinrothberg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1c7e324
  9. 20 1月, 2016 3 次提交
  10. 19 1月, 2016 3 次提交
    • J
      x86/asm: Add C versions of frame pointer macros · ec518655
      Josh Poimboeuf 提交于
      Add C versions of the frame pointer macros which can be used to
      create a stack frame in inline assembly.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chris J Arges <chris.j.arges@canonical.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/f6786a282bf232ede3e2866414eae3cf02c7d662.1450442274.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec518655
    • J
      x86/asm: Clean up frame pointer macros · 997963ed
      Josh Poimboeuf 提交于
      The asm macros for setting up and restoring the frame pointer
      aren't currently being used.  However, they will be needed soon
      to help asm functions to comply with stacktool.
      
      Rename FRAME/ENDFRAME to FRAME_BEGIN/FRAME_END for more
      symmetry.  Also make the code more readable and improve the
      comments.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chris J Arges <chris.j.arges@canonical.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/3f488a8e3bfc8ac7d4d3d350953e664e7182b044.1450442274.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      997963ed
    • B
      x86/cpufeature: Add AMD AVIC bit · a1ff5726
      Borislav Petkov 提交于
      CPUID Fn8000_000A_EDX[13] denotes support for AMD's Virtual
      Interrupt controller, i.e., APIC virtualization.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Kaplan <david.kaplan@amd.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Link: http://lkml.kernel.org/r/1452938292-12327-1-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1ff5726
  11. 16 1月, 2016 10 次提交
    • D
      mm, x86: get_user_pages() for dax mappings · 3565fce3
      Dan Williams 提交于
      A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
      has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
      return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
      encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
      struct dev_pagemap instance to keep the result of pfn_to_page() valid
      until put_page().
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3565fce3
    • D
      mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd · 5c7fb56e
      Dan Williams 提交于
      A dax-huge-page mapping while it uses some thp helpers is ultimately not
      a transparent huge page.  The distinction is especially important in the
      get_user_pages() path.  pmd_devmap() is used to distinguish dax-pmds
      from pmd_huge() and pmd_trans_huge() which have slightly different
      semantics.
      
      Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
      pmd_devmap() checks.
      
      [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in  __split_huge_pmd()]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c7fb56e
    • D
      mm, dax: convert vmf_insert_pfn_pmd() to pfn_t · f25748e3
      Dan Williams 提交于
      Similar to the conversion of vm_insert_mixed() use pfn_t in the
      vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
      pfn is backed by a devm_memremap_pages() mapping.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f25748e3
    • D
      mm, dax, gpu: convert vm_insert_mixed to pfn_t · 01c8f1c4
      Dan Williams 提交于
      Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
      evaluating the PFN_MAP and PFN_DEV flags.  When both are set it triggers
      _PAGE_DEVMAP to be set in the resulting pte.
      
      There are no functional changes to the gpu drivers as a result of this
      conversion.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Airlie <airlied@linux.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c8f1c4
    • D
      x86, mm: introduce _PAGE_DEVMAP · 69660fd7
      Dan Williams 提交于
      _PAGE_DEVMAP is a hardware-unused pte bit that will later be used in the
      get_user_pages() path to identify pfns backed by the dynamic allocation
      established by devm_memremap_pages.  Upon seeing that bit the gup path
      will lookup and pin the allocation while the pages are in use.
      
      Since the _PAGE_DEVMAP bit is > 32 it must be cast to u64 instead of a
      pteval_t to allow pmd_flags() usage in the realmode boot code to build.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69660fd7
    • D
      pmem, dax: clean up clear_pmem() · 52db400f
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 25):
      
      Both __dax_pmd_fault, and clear_pmem() were taking special steps to
      clear memory a page at a time to take advantage of non-temporal
      clear_page() implementations.  However, x86_64 does not use non-temporal
      instructions for clear_page(), and arch_clear_pmem() was always
      incurring the cost of __arch_wb_cache_pmem().
      
      Clean up the assumption that doing clear_pmem() a page at a time is more
      performant.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52db400f
    • M
      arch/x86/include/asm/pgtable.h: add pmd_[dirty|mkclean] for THP · 590a471c
      Minchan Kim 提交于
      MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
      of the contents since MADV_FREE syscall is called for THP page.
      
      This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
      support.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      590a471c
    • K
      x86, thp: remove infrastructure for handling splitting PMDs · 1f19617d
      Kirill A. Shutemov 提交于
      With new refcounting we don't need to mark PMDs splitting.  Let's drop
      code to handle this.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f19617d
    • K
      x86/PCI: Add driver for Intel Volume Management Device (VMD) · 185a383a
      Keith Busch 提交于
      The Intel Volume Management Device (VMD) is a Root Complex Integrated
      Endpoint that acts as a host bridge to a secondary PCIe domain.  BIOS can
      reassign one or more Root Ports to appear within a VMD domain instead of
      the primary domain.  The immediate benefit is that additional PCIe domains
      allow more than 256 buses in a system by letting bus numbers be reused
      across different domains.
      
      VMD domains do not define ACPI _SEG, so to avoid domain clashing with host
      bridges defining this segment, VMD domains start at 0x10000, which is
      greater than the highest possible 16-bit ACPI defined _SEG.
      
      This driver enumerates and enables the domain using the root bus
      configuration interface provided by the PCI subsystem.  The driver provides
      configuration space accessor functions (pci_ops), bus and memory resources,
      an MSI IRQ domain with irq_chip implementation, and DMA operations
      necessary to use devices through the VMD endpoint's interface.
      
      VMD routes I/O as follows:
      
         1) Configuration Space: BAR 0 ("CFGBAR") of VMD provides the base
         address and size for configuration space register access to VMD-owned
         root ports.  It works similarly to MMCONFIG for extended configuration
         space.  Bus numbering is independent and does not conflict with the
         primary domain.
      
         2) MMIO Space: BARs 2 and 4 ("MEMBAR1" and "MEMBAR2") of VMD provide the
         base address, size, and type for MMIO register access.  These addresses
         are not translated by VMD hardware; they are simply reservations to be
         distributed to root ports' memory base/limit registers and subdivided
         among devices downstream.
      
         3) DMA: To interact appropriately with an IOMMU, the source ID DMA read
         and write requests are translated to the bus-device-function of the VMD
         endpoint.  Otherwise, DMA operates normally without VMD-specific address
         translation.
      
         4) Interrupts: Part of VMD's BAR 4 is reserved for VMD's MSI-X Table and
         PBA.  MSIs from VMD domain devices and ports are remapped to appear as
         if they were issued using one of VMD's MSI-X table entries.  Each MSI
         and MSI-X address of VMD-owned devices and ports has a special format
         where the address refers to specific entries in the VMD's MSI-X table.
         As with DMA, the interrupt source ID is translated to VMD's
         bus-device-function.
      
         The driver provides its own MSI and MSI-X configuration functions
         specific to how MSI messages are used within the VMD domain, and
         provides an irq_chip for independent IRQ allocation to relay interrupts
         from VMD's interrupt handler to the appropriate device driver's handler.
      
         5) Errors: PCIe error message are intercepted by the root ports normally
         (e.g., AER), except with VMD, system errors (i.e., firmware first) are
         disabled by default.  AER and hotplug interrupts are translated in the
         same way as endpoint interrupts.
      
         6) VMD does not support INTx interrupts or IO ports.  Devices or drivers
         requiring these features should either not be placed below VMD-owned
         root ports, or VMD should be disabled by BIOS for such endpoints.
      
      [bhelgaas: add VMD BAR #defines, factor out vmd_cfg_addr(), rework VMD
      resource setup, whitespace, changelog]
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: Thomas Gleixner <tglx@linutronix.de> (IRQ-related parts)
      185a383a
    • K
      x86/PCI: Allow DMA ops specific to a PCI domain · d9c3d6ff
      Keith Busch 提交于
      The Intel Volume Management Device (VMD) is a PCIe endpoint that acts as a
      host bridge to another PCI domain.  When devices below the VMD perform DMA,
      the VMD replaces their DMA source IDs with its own source ID.  Therefore,
      those devices require special DMA ops.
      
      Add interfaces to allow the VMD driver to set up dma_ops for the devices
      below it.
      
      [bhelgaas: remove "extern", add "static", changelog]
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      d9c3d6ff
  12. 15 1月, 2016 1 次提交
  13. 13 1月, 2016 3 次提交
  14. 12 1月, 2016 2 次提交
    • R
      lguest: Map switcher text R/O · e27d90e8
      Rusty Russell 提交于
      Pavel noted that lguest maps the switcher code executable and
      read-write.  This is a bad idea for any kernel text, but
      particularly for text mapped at a fixed address.
      
      Create two vmas, one for the text (PAGE_KERNEL_RX) and another
      for the stacks (PAGE_KERNEL).  Use VM_NO_GUARD to map them
      adjacent (as expected by the rest of the code).
      Reported-by: NPavel Machek <pavel@ucw.cz>
      Tested-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e27d90e8
    • A
      x86/vdso: Disallow vvar access to vclock IO for never-used vclocks · bd902c53
      Andy Lutomirski 提交于
      It makes me uncomfortable that even modern systems grant every
      process direct read access to the HPET.
      
      While fixing this for real without regressing anything is a mess
      (unmapping the HPET is tricky because we don't adequately track
      all the mappings), we can do almost as well by tracking which
      vclocks have ever been used and only allowing pages associated
      with used vclocks to be faulted in.
      
      This will cause rogue programs that try to peek at the HPET to
      get SIGBUS instead on most systems.
      
      We can't restrict faults to vclock pages that are associated
      with the currently selected vclock due to a race: a process
      could start to access the HPET for the first time and race
      against a switch away from the HPET as the current clocksource.
      We can't segfault the process trying to peek at the HPET in this
      case, even though the process isn't going to do anything useful
      with the data.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/e79d06295625c02512277737ab55085a498ac5d8.1451446564.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd902c53