1. 11 12月, 2019 1 次提交
  2. 10 12月, 2019 2 次提交
    • I
      x86/setup: Enhance the comments · 360db4ac
      Ingo Molnar 提交于
      Update various comments, fix outright mistakes and meaningless descriptions.
      
      Also harmonize the style across the file, both in form and in language.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      360db4ac
    • I
      x86/setup: Clean up the header portion of setup.c · 12609013
      Ingo Molnar 提交于
      In 20 years we accumulated 89 #include lines in setup.c,
      but we only need 30 of them (!) ...
      
      Get rid of the excessive ones, and while at it, sort the
      remaining ones alphabetically.
      
      Also get rid of the incomplete changelogs at the header of the file,
      and explain better what this file does.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      12609013
  3. 05 12月, 2019 2 次提交
    • M
      arch: sembuf.h: make uapi asm/sembuf.h self-contained · 0fb9dc28
      Masahiro Yamada 提交于
      Userspace cannot compile <asm/sembuf.h> due to some missing type
      definitions.  For example, building it for x86 fails as follows:
      
          CC      usr/include/asm/sembuf.h.s
        In file included from <command-line>:32:0:
        usr/include/asm/sembuf.h:17:20: error: field `sem_perm' has incomplete type
          struct ipc64_perm sem_perm; /* permissions .. see ipc.h */
                            ^~~~~~~~
        usr/include/asm/sembuf.h:24:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t sem_otime; /* last semop time */
          ^~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:25:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused1;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:26:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t sem_ctime; /* last change time */
          ^~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:27:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused2;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:29:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t sem_nsems; /* no. of semaphores in array */
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:30:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused3;
          ^~~~~~~~~~~~~~~~
        usr/include/asm/sembuf.h:31:2: error: unknown type name `__kernel_ulong_t'
          __kernel_ulong_t __unused4;
          ^~~~~~~~~~~~~~~~
      
      It is just a matter of missing include directive.
      
      Include <asm/ipcbuf.h> to make it self-contained, and add it to
      the compile-test coverage.
      
      Link: http://lkml.kernel.org/r/20191030063855.9989-3-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fb9dc28
    • M
      arch: msgbuf.h: make uapi asm/msgbuf.h self-contained · 9ef0e004
      Masahiro Yamada 提交于
      Userspace cannot compile <asm/msgbuf.h> due to some missing type
      definitions.  For example, building it for x86 fails as follows:
      
          CC      usr/include/asm/msgbuf.h.s
        In file included from usr/include/asm/msgbuf.h:6:0,
                         from <command-line>:32:
        usr/include/asm-generic/msgbuf.h:25:20: error: field `msg_perm' has incomplete type
          struct ipc64_perm msg_perm;
                            ^~~~~~~~
        usr/include/asm-generic/msgbuf.h:27:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_stime; /* last msgsnd time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:28:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_rtime; /* last msgrcv time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:29:2: error: unknown type name `__kernel_time_t'
          __kernel_time_t msg_ctime; /* last change time */
          ^~~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:41:2: error: unknown type name `__kernel_pid_t'
          __kernel_pid_t msg_lspid; /* pid of last msgsnd */
          ^~~~~~~~~~~~~~
        usr/include/asm-generic/msgbuf.h:42:2: error: unknown type name `__kernel_pid_t'
          __kernel_pid_t msg_lrpid; /* last receive pid */
          ^~~~~~~~~~~~~~
      
      It is just a matter of missing include directive.
      
      Include <asm/ipcbuf.h> to make it self-contained, and add it to
      the compile-test coverage.
      
      Link: http://lkml.kernel.org/r/20191030063855.9989-2-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ef0e004
  4. 04 12月, 2019 2 次提交
  5. 02 12月, 2019 2 次提交
    • D
      x86/kasan: support KASAN_VMALLOC · 0609ae01
      Daniel Axtens 提交于
      In the case where KASAN directly allocates memory to back vmalloc space,
      don't map the early shadow page over it.
      
      We prepopulate pgds/p4ds for the range that would otherwise be empty.
      This is required to get it synced to hardware on boot, allowing the
      lower levels of the page tables to be filled dynamically.
      
      Link: http://lkml.kernel.org/r/20191031093909.9228-5-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0609ae01
    • I
      x86/mm/pat: Fix off-by-one bugs in interval tree search · 91298f1a
      Ingo Molnar 提交于
      There's a bug in the new PAT code, the conversion of memtype_check_conflict()
      is buggy:
      
         8d04a5f9: ("x86/mm/pat: Convert the PAT tree to a generic interval tree")
      
              dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
              found_type = match->type;
      
      -       node = rb_next(&match->rb);
      -       while (node) {
      -               match = rb_entry(node, struct memtype, rb);
      -
      -               if (match->start >= end) /* Checked all possible matches */
      -                       goto success;
      -
      -               if (is_node_overlap(match, start, end) &&
      -                   match->type != found_type) {
      +       match = memtype_interval_iter_next(match, start, end);
      +       while (match) {
      +               if (match->type != found_type)
                              goto failure;
      -               }
      
      -               node = rb_next(&match->rb);
      +               match = memtype_interval_iter_next(match, start, end);
              }
      
      Note how the '>= end' condition to end the interval check, got converted
      into:
      
      +       match = memtype_interval_iter_next(match, start, end);
      
      This is subtly off by one, because the interval trees interfaces require
      closed interval parameters:
      
        include/linux/interval_tree_generic.h
      
       /*                                                                            \
        * Iterate over intervals intersecting [start;last]                           \
        *                                                                            \
        * Note that a node's interval intersects [start;last] iff:                   \
        *   Cond1: ITSTART(node) <= last                                             \
        * and                                                                        \
        *   Cond2: start <= ITLAST(node)                                             \
        */                                                                           \
      
        ...
      
                      if (ITSTART(node) <= last) {            /* Cond1 */           \
                              if (start <= ITLAST(node))      /* Cond2 */           \
                                      return node;    /* node is leftmost match */  \
      
      [start;last] is a closed interval (note that '<= last' check) - while the
      PAT 'end' parameter is 1 byte beyond the end of the range, because
      ioremap() and the other mapping APIs usually use the [start,end)
      half-open interval, derived from 'size'.
      
      This is what ioremap() does for example:
      
              /*
               * Mappings have to be page-aligned
               */
              offset = phys_addr & ~PAGE_MASK;
              phys_addr &= PHYSICAL_PAGE_MASK;
              size = PAGE_ALIGN(last_addr+1) - phys_addr;
      
              retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
                                                      pcm, &new_pcm);
      
      phys_addr+size will be on a page boundary, after the last byte of the
      mapped interval.
      
      So the correct parameter to use in the interval tree searches is not
      'end' but 'end-1'.
      
      This could have relevance if conflicting PAT ranges are exactly adjacent,
      for example a future WC region is followed immediately by an already
      mapped UC- region - in this case memtype_check_conflict() would
      incorrectly deny the WC memtype region and downgrade the memtype to UC-.
      
      BTW., rather annoyingly this downgrading is done silently in
      memtype_check_insert():
      
      int memtype_check_insert(struct memtype *new,
                               enum page_cache_mode *ret_type)
      {
              int err = 0;
      
              err = memtype_check_conflict(new->start, new->end, new->type, ret_type);
              if (err)
                      return err;
      
              if (ret_type)
                      new->type = *ret_type;
      
              memtype_interval_insert(new, &memtype_rbroot);
              return 0;
      }
      
      So on such a conflict we'd just silently get UC- in *ret_type, and write
      it into the new region, never the wiser ...
      
      So assuming that the patch below fixes the primary bug the diagnostics
      side of ioremap() cache attribute downgrades would be another thing to
      fix.
      
      Anyway, I checked all the interval-tree iterations, and most of them are
      off by one - but I think the one related to memtype_check_conflict() is
      the one causing this particular performance regression.
      
      The only correct interval-tree searches were these two:
      
        arch/x86/mm/pat_interval.c:     match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX);
        arch/x86/mm/pat_interval.c:             match = memtype_interval_iter_next(match, 0, ULONG_MAX);
      
      The ULONG_MAX was hiding the off-by-one in plain sight. :-)
      
      Note that the bug was probably benign in the sense of implementing a too
      strict cache attribute conflict policy and downgrading cache attributes,
      so AFAICS the worst outcome of this bug would be a performance regression,
      not any instabilities.
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Reported-by: NKenneth R. Crudup <kenny@panix.com>
      Reported-by: NMariusz Ceier <mceier+kernel@gmail.com>
      Tested-by: NMariusz Ceier <mceier@gmail.com>
      Tested-by: NKenneth R. Crudup <kenny@panix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191201144947.GA4167@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      91298f1a
  6. 01 12月, 2019 1 次提交
  7. 29 11月, 2019 1 次提交
  8. 28 11月, 2019 1 次提交
    • S
      x86/fpu: Don't cache access to fpu_fpregs_owner_ctx · 59c4bd85
      Sebastian Andrzej Siewior 提交于
      The state/owner of the FPU is saved to fpu_fpregs_owner_ctx by pointing
      to the context that is currently loaded. It never changed during the
      lifetime of a task - it remained stable/constant.
      
      After deferred FPU registers loading until return to userland was
      implemented, the content of fpu_fpregs_owner_ctx may change during
      preemption and must not be cached.
      
      This went unnoticed for some time and was now noticed, in particular
      since gcc 9 is caching that load in copy_fpstate_to_sigframe() and
      reusing it in the retry loop:
      
        copy_fpstate_to_sigframe()
          load fpu_fpregs_owner_ctx and save on stack
          fpregs_lock()
          copy_fpregs_to_sigframe() /* failed */
          fpregs_unlock()
               *** PREEMPTION, another uses FPU, changes fpu_fpregs_owner_ctx ***
      
          fault_in_pages_writeable() /* succeed, retry */
      
          fpregs_lock()
      	__fpregs_load_activate()
      	  fpregs_state_valid() /* uses fpu_fpregs_owner_ctx from stack */
          copy_fpregs_to_sigframe() /* succeeds, random FPU content */
      
      This is a comparison of the assembly produced by gcc 9, without vs with this
      patch:
      
      | # arch/x86/kernel/fpu/signal.c:173:      if (!access_ok(buf, size))
      |        cmpq    %rdx, %rax      # tmp183, _4
      |        jb      .L190   #,
      |-# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |-#APP
      |-# 512 "arch/x86/include/asm/fpu/internal.h" 1
      |-       movq %gs:fpu_fpregs_owner_ctx,%rax      #, pfo_ret__
      |-# 0 "" 2
      |-#NO_APP
      |-       movq    %rax, -88(%rbp) # pfo_ret__, %sfp
      …
      |-# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |-       movq    -88(%rbp), %rcx # %sfp, pfo_ret__
      |-       cmpq    %rcx, -64(%rbp) # pfo_ret__, %sfp
      |+# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |+#APP
      |+# 512 "arch/x86/include/asm/fpu/internal.h" 1
      |+       movq %gs:fpu_fpregs_owner_ctx(%rip),%rax        # fpu_fpregs_owner_ctx, pfo_ret__
      |+# 0 "" 2
      |+# arch/x86/include/asm/fpu/internal.h:512:       return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
      |+#NO_APP
      |+       cmpq    %rax, -64(%rbp) # pfo_ret__, %sfp
      
      Use this_cpu_read() instead this_cpu_read_stable() to avoid caching of
      fpu_fpregs_owner_ctx during preemption points.
      
      The Fixes: tag points to the commit where deferred FPU loading was
      added. Since this commit, the compiler is no longer allowed to move the
      load of fpu_fpregs_owner_ctx somewhere else / outside of the locked
      section. A task preemption will change its value and stale content will
      be observed.
      
       [ bp: Massage. ]
      Debugged-by: NAustin Clements <austin@google.com>
      Debugged-by: NDavid Chase <drchase@golang.org>
      Debugged-by: NIan Lance Taylor <ian@airs.com>
      Fixes: 5f409e20 ("x86/fpu: Defer FPU state load until return to userspace")
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Cc: Aubrey Li <aubrey.li@intel.com>
      Cc: Austin Clements <austin@google.com>
      Cc: Barret Rhoden <brho@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Chase <drchase@golang.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: ian@airs.com
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Bleecher Snyder <josharian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191128085306.hxfa2o3knqtu4wfn@linutronix.de
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=205663
      59c4bd85
  9. 27 11月, 2019 13 次提交
    • P
      KVM x86: Move kvm cpuid support out of svm · c1de0f25
      Peter Gonda 提交于
      Memory encryption support does not have module parameter dependencies
      and can be moved into the general x86 cpuid __do_cpuid_ent function.
      This changes maintains current behavior of passing through all of
      CPUID.8000001F.
      Suggested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPeter Gonda <pgonda@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c1de0f25
    • B
      x86/entry/32: Remove unused 'restore_all_notrace' local label · 3e1b4358
      Borislav Petkov 提交于
      Signed-off-by: NBorislav Petkov <bp@alien8.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3e1b4358
    • A
      perf/x86: Implement immediate enforcement of /sys/devices/cpu/rdpmc value of 0 · 405b4537
      Anthony Steinhauser 提交于
      When you successfully write 0 to /sys/devices/cpu/rdpmc, the RDPMC
      instruction should be disabled unconditionally and immediately (after you
      close the SYSFS file) by the documentation.
      
      Instead, in the current implementation the PMU must be reloaded which
      happens only eventually some time in the future. Only after that the RDPMC
      instruction becomes disabled (on ring 3) on the respective core.
      
      This change makes the treatment of the 0 value as blocking and as
      unconditional as the current treatment of the 2 value, only the CR4.PCE
      bit is naturally set to false instead of true.
      Signed-off-by: NAnthony Steinhauser <asteinhauser@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: https://lkml.kernel.org/r/20191125054838.137615-1-asteinhauser@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      405b4537
    • J
      crypto: arch - conditionalize crypto api in arch glue for lib code · 8394bfec
      Jason A. Donenfeld 提交于
      For glue code that's used by Zinc, the actual Crypto API functions might
      not necessarily exist, and don't need to exist either. Before this
      patch, there are valid build configurations that lead to a unbuildable
      kernel. This fixes it to conditionalize those symbols on the existence
      of the proper config entry.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      8394bfec
    • A
      x86/ptrace: Document FSBASE and GSBASE ABI oddities · 56f2ab41
      Andy Lutomirski 提交于
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      56f2ab41
    • A
      x86/ptrace: Remove set_segment_reg() implementations for current · 8e05f1b4
      Andy Lutomirski 提交于
      seg_segment_reg() should be unreachable with task == current.
      Rather than confusingly trying to make it work, just explicitly
      disable this case.
      
      (regset->get is used for current in the coredump code, but the ->set
       interface is only used for ptrace, and you can't ptrace yourself.)
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8e05f1b4
    • A
      x86/traps: die() instead of panicking on a double fault · 0337b7eb
      Andy Lutomirski 提交于
      A double fault has a decent chance of being recoverable by killing
      the offending thread.  Use die() so that we at least try to recover.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0337b7eb
    • A
      x86/doublefault/32: Rewrite the x86_32 #DF handler and unify with 64-bit · 7d8d8cfd
      Andy Lutomirski 提交于
      The old x86_32 doublefault_fn() was old and crufty, and it did not
      even try to recover.  do_double_fault() is much nicer.  Rewrite the
      32-bit double fault code to sanitize CPU state and call
      do_double_fault().  This is mostly an exercise i386 archaeology.
      
      With this patch applied, 32-bit double faults get a real stack trace,
      just like 64-bit double faults.
      
      [ mingo: merged the patch to a later kernel base. ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7d8d8cfd
    • A
      x86/doublefault/32: Move #DF stack and TSS to cpu_entry_area · dc4e0021
      Andy Lutomirski 提交于
      There are three problems with the current layout of the doublefault
      stack and TSS.  First, the TSS is only cacheline-aligned, which is
      not enough -- if the hardware portion of the TSS (struct x86_hw_tss)
      crosses a page boundary, horrible things happen [0].  Second, the
      stack and TSS are global, so simultaneous double faults on different
      CPUs will cause massive corruption.  Third, the whole mechanism
      won't work if user CR3 is loaded, resulting in a triple fault [1].
      
      Let the doublefault stack and TSS share a page (which prevents the
      TSS from spanning a page boundary), make it percpu, and move it into
      cpu_entry_area.  Teach the stack dump code about the doublefault
      stack.
      
      [0] Real hardware will read past the end of the page onto the next
          *physical* page if a task switch happens.  Virtual machines may
          have any number of bugs, and I would consider it reasonable for
          a VM to summarily kill the guest if it tries to task-switch to
          a page-spanning TSS.
      
      [1] Real hardware triple faults.  At least some VMs seem to hang.
          I'm not sure what's going on.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dc4e0021
    • A
      x86/doublefault/32: Rename doublefault.c to doublefault_32.c · e99b6f46
      Andy Lutomirski 提交于
      doublefault.c now only contains 32-bit code.  Rename it to
      doublefault_32.c.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e99b6f46
    • A
      x86/traps: Disentangle the 32-bit and 64-bit doublefault code · 93efbde2
      Andy Lutomirski 提交于
      The 64-bit doublefault handler is much nicer than the 32-bit one.
      As a first step toward unifying them, make the 64-bit handler
      self-contained.  This should have no effect no functional effect
      except in the odd case of x86_64 with CONFIG_DOUBLEFAULT=n in which
      case it will change the logging a bit.
      
      This also gets rid of CONFIG_DOUBLEFAULT configurability on 64-bit
      kernels.  It didn't do anything useful -- CONFIG_DOUBLEFAULT=n
      didn't actually disable doublefault handling on x86_64.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      93efbde2
    • J
      x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all() · 9a62d200
      Joerg Roedel 提交于
      The job of vmalloc_sync_all() is to help the lazy freeing of vmalloc()
      ranges: before such vmap ranges are reused we make sure that they are
      unmapped from every task's page tables.
      
      This is really easy on pagetable setups where the kernel page tables
      are shared between all tasks - this is the case on 32-bit kernels
      with SHARED_KERNEL_PMD = 1.
      
      But on !SHARED_KERNEL_PMD 32-bit kernels this involves iterating
      over the pgd_list and clearing all pmd entries in the pgds that
      are cleared in the init_mm.pgd, which is the reference pagetable
      that the vmalloc() code uses.
      
      In that context the current practice of vmalloc_sync_all() iterating
      until FIX_ADDR_TOP is buggy:
      
              for (address = VMALLOC_START & PMD_MASK;
                   address >= TASK_SIZE_MAX && address < FIXADDR_TOP;
                   address += PMD_SIZE) {
                      struct page *page;
      
      Because iterating up to FIXADDR_TOP will involve a lot of non-vmalloc
      address ranges:
      
      	VMALLOC -> PKMAP -> LDT -> CPU_ENTRY_AREA -> FIX_ADDR
      
      This is mostly harmless for the FIX_ADDR and CPU_ENTRY_AREA ranges
      that don't clear their pmds, but it's lethal for the LDT range,
      which relies on having different mappings in different processes,
      and 'synchronizing' them in the vmalloc sense corrupts those
      pagetable entries (clearing them).
      
      This got particularly prominent with PTI, which turns SHARED_KERNEL_PMD
      off and makes this the dominant mapping mode on 32-bit.
      
      To make LDT working again vmalloc_sync_all() must only iterate over
      the volatile parts of the kernel address range that are identical
      between all processes.
      
      So the correct check in vmalloc_sync_all() is "address < VMALLOC_END"
      to make sure the VMALLOC areas are synchronized and the LDT
      mapping is not falsely overwritten.
      
      The CPU_ENTRY_AREA and the FIXMAP area are no longer synced either,
      but this is not really a proplem since their PMDs get established
      during bootup and never change.
      
      This change fixes the ldt_gdt selftest in my setup.
      
      [ mingo: Fixed up the changelog to explain the logic and modified the
               copying to only happen up until VMALLOC_END. ]
      Reported-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Fixes: 7757d607: ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
      Link: https://lkml.kernel.org/r/20191126111119.GA110513@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9a62d200
    • I
      x86/iopl: Make 'struct tss_struct' constant size again · 0bcd7762
      Ingo Molnar 提交于
      After the following commit:
      
        05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      
      'struct cpu_entry_area' has to be Kconfig invariant, so that we always
      have a matching CPU_ENTRY_AREA_PAGES size.
      
      This commit added a CONFIG_X86_IOPL_IOPERM dependency to tss_struct:
      
        111e7b15: ("x86/ioperm: Extend IOPL config to control ioperm() as well")
      
      Which, if CONFIG_X86_IOPL_IOPERM is turned off, reduces the size of
      cpu_entry_area by two pages, triggering the assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_202’ declared with attribute error: BUILD_BUG_ON failed: (CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE
      
      Simplify the Kconfig dependencies and make cpu_entry_area constant
      size on 32-bit kernels again.
      
      Fixes: 05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0bcd7762
  10. 26 11月, 2019 2 次提交
    • A
      x86/insn: Add some more Intel instructions to the opcode map · af4933c1
      Adrian Hunter 提交于
      Add to the opcode map the following instructions:
      
      	v4fmaddps
      	v4fmaddss
      	v4fnmaddps
      	v4fnmaddss
      	vaesdec
      	vaesdeclast
      	vaesenc
      	vaesenclast
      	vcvtne2ps2bf16
      	vcvtneps2bf16
      	vdpbf16ps
      	gf2p8affineinvqb
      	vgf2p8affineinvqb
      	gf2p8affineqb
      	vgf2p8affineqb
      	gf2p8mulb
      	vgf2p8mulb
      	vp2intersectd
      	vp2intersectq
      	vp4dpwssd
      	vp4dpwssds
      	vpclmulqdq
      	vpcompressb
      	vpcompressw
      	vpdpbusd
      	vpdpbusds
      	vpdpwssd
      	vpdpwssds
      	vpexpandb
      	vpexpandw
      	vpopcntb
      	vpopcntd
      	vpopcntq
      	vpopcntw
      	vpshldd
      	vpshldq
      	vpshldvd
      	vpshldvq
      	vpshldvw
      	vpshldw
      	vpshrdd
      	vpshrdq
      	vpshrdvd
      	vpshrdvq
      	vpshrdvw
      	vpshrdw
      	vpshufbitqmb
      
      For information about the instructions, refer Intel SDM May 2019
      (325462-070US) and Intel Architecture Instruction Set Extensions May
      2019 (319433-037).
      
      The instruction decoding can be tested using the perf tools' "x86
      instruction decoder - new instructions" test e.g.
      
        $ perf test -v "new " 2>&1 | grep -i 'v4fmaddps'
        Decoded ok: 62 f2 7f 48 9a 20                   v4fmaddps (%eax),%zmm0,%zmm4
        Decoded ok: 62 f2 7f 48 9a a4 c8 78 56 34 12    v4fmaddps 0x12345678(%eax,%ecx,8),%zmm0,%zmm4
        Decoded ok: 62 f2 7f 48 9a 20                   v4fmaddps (%rax),%zmm0,%zmm4
        Decoded ok: 67 62 f2 7f 48 9a 20                v4fmaddps (%eax),%zmm0,%zmm4
        Decoded ok: 62 f2 7f 48 9a a4 c8 78 56 34 12    v4fmaddps 0x12345678(%rax,%rcx,8),%zmm0,%zmm4
        Decoded ok: 67 62 f2 7f 48 9a a4 c8 78 56 34 12 v4fmaddps 0x12345678(%eax,%ecx,8),%zmm0,%zmm4
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Cc: x86@kernel.org
      Link: http://lore.kernel.org/lkml/20191125125044.31879-3-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      af4933c1
    • A
      y2038: ipc: fix x32 ABI breakage · af378468
      Arnd Bergmann 提交于
      The correct type on x32 is 64-bit wide, same as for the other struct
      members around it, so use  __kernel_long_t in place of the original
      __kernel_time_t here, corresponding to the rest of the structure.
      
      Fixes: caf5e32d ("y2038: ipc: remove __kernel_time_t reference from headers")
      Reported-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      af378468
  11. 25 11月, 2019 6 次提交
    • A
      x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3 · 4a13b0e3
      Andy Lutomirski 提交于
      UNWIND_ESPFIX_STACK needs to read the GDT, and the GDT mapping that
      can be accessed via %fs is not mapped in the user pagetables.  Use
      SGDT to find the cpu_entry_area mapping and read the espfix offset
      from that instead.
      Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4a13b0e3
    • W
      locking/refcount: Consolidate implementations of refcount_t · fb041bb7
      Will Deacon 提交于
      The generic implementation of refcount_t should be good enough for
      everybody, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely,
      leaving the generic implementation enabled unconditionally.
      Signed-off-by: NWill Deacon <will@kernel.org>
      Reviewed-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Tested-by: NHanjun Guo <guohanjun@huawei.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191121115902.2551-9-will@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb041bb7
    • I
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the... · 05b042a1
      Ingo Molnar 提交于
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
      
      When two recent commits that increased the size of the 'struct cpu_entry_area'
      were merged in -tip, the 32-bit defconfig build started failing on the following
      build time assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
        arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’
        In function ‘setup_cpu_entry_area_ptes’,
      
      Which corresponds to the following build time assert:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      The purpose of this assert is to sanity check the fixed-value definition of
      CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h:
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 41)
      
      The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value
      we didn't want to define in such a low level header, because it would cause
      dependency hell.
      
      Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES
      accordingly - and this assert is checking that constraint.
      
      But the assert is both imprecise and buggy, primarily because it doesn't
      include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE
      (which begins at a PMD boundary).
      
      This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined
      too large upstream (v5.4-rc8):
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 40)
      
      While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra
      pages, which hid the bug.
      
      The following commit (not yet upstream) increased the size to 40 pages:
      
        x86/iopl: ("Restrict iopl() permission scope")
      
      ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap
      to just 1 extra page.
      
      Then another not-yet-upstream commit changed the size again:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Which increased the cpu_entry_area size from 38 to 39 pages, but
      didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked
      fine, because we still had a page left from the accidental 'reserve'.
      
      But when these two commits were merged into the same tree, the
      combined size of cpu_entry_area grew from 38 to 40 pages, while
      CPU_ENTRY_AREA_PAGES finally caught up to 40 as well.
      
      Which is fine in terms of functionality, but the assert broke:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area,
      which is 1 page larger due to the IDT page.
      
      To fix all this, change the assert to two precise asserts:
      
      	BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      	BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      
      This takes the IDT page into account, and also connects the size-based
      define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based
      define of CPU_ENTRY_AREA_MAP_SIZE.
      
      Also clean up some of the names which made it rather confusing:
      
       - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of
         the cpu-entry-area, but the per-cpu array size, so rename this
         to CPU_ENTRY_AREA_ARRAY_SIZE.
      
       - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping
         size, with the IDT included.
      
       - Add comments where '+1' denotes the IDT mapping - it wasn't
         obvious and took me about 3 hours to decode...
      
      Finally, because this particular commit is actually applied after
      this patch:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages.
      
      All future commits that change cpu_entry_area will have to adjust
      this value precisely.
      
      As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES
      and derive its value directly from the structure, without causing
      header hell - but that is an adventure for another day! :-)
      
      Fixes: 880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05b042a1
    • D
      bpf: Simplify __bpf_arch_text_poke poke type handling · b553a6ec
      Daniel Borkmann 提交于
      Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
      and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
      old_addr as well as new_addr, it's a bit redundant and unnecessarily
      complicates __bpf_arch_text_poke() itself since we can derive the same from
      the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
      as types which also allows to clean up call-sites.
      
      In addition to that, __bpf_arch_text_poke() currently verifies that text
      matches expected old_insn before we invoke text_poke_bp(). Also add a check
      on new_insn and skip rewrite if it already matches. Reason why this is rather
      useful is that it avoids making any special casing in prog_array_map_poke_run()
      when old and new prog were NULL and has the benefit that also for this case
      we perform a check on text whether it really matches our expectations.
      Suggested-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
      b553a6ec
    • D
      bpf, x86: Emit patchable direct jump as tail call · 428d5df1
      Daniel Borkmann 提交于
      Add initial code emission for *direct* jumps for tail call maps in
      order to avoid the retpoline overhead from a493a87f ("bpf, x64:
      implement retpoline for tail call") for situations that allow for
      it, meaning, for known constant keys at verification time which are
      used as index into the tail call map. In case of Cilium which makes
      heavy use of tail calls, constant keys are used in the vast majority,
      only for a single occurrence we use a dynamic key.
      
      High level outline is that if the target prog is NULL in the map, we
      emit a 5-byte nop for the fall-through case and if not, we emit a
      5-byte direct relative jmp to the target bpf_func + skipped prologue
      offset. Later during runtime, we patch these 5-byte nop/jmps upon
      tail call map update or deletions dynamically. Note that on x86-64
      the direct jmp works as we reuse the same stack frame and skip
      prologue (as opposed to some other JIT implementations).
      
      One of the issues is that the tail call map slots can change at any
      given time even during JITing. Therefore, we have two passes: i) emit
      nops for all patchable locations during main JITing phase until we
      declare prog->jited = 1 eventually. At this point the image is stable,
      not public yet and with all jmps disabled. While JITing, we collect
      additional info like poke->ip in order to remember the patch location
      for later modifications. In ii) bpf_tail_call_direct_fixup() walks
      over the progs poke_tab, locks the tail call maps poke_mutex to
      prevent from parallel updates and patches in the right locations via
      __bpf_arch_text_poke(). Note, the main bpf_arch_text_poke() cannot
      be used at this point since we're not yet exposed to kallsyms. For
      the update we use plain memcpy() since the image is not public and
      still in read-write mode. After patching, we activate that poke entry
      through poke->ip_stable. Meaning, at this point any tail call map
      updates/deletions are not going to ignore that poke entry anymore.
      Then, bpf_arch_text_poke() might still occur on the read-write image
      until we finally locked it as read-only. Both modifications on the
      given image are under text_mutex to avoid interference with each
      other when update requests come in in parallel for different tail
      call maps (current one we have locked in JIT and different one where
      poke->ip_stable was already set).
      
      Example prog:
      
        # ./bpftool p d x i 1655
         0: (b7) r3 = 0
         1: (18) r2 = map[id:526]
         3: (85) call bpf_tail_call#12
         4: (b7) r0 = 1
         5: (95) exit
      
      Before:
      
        # ./bpftool p d j i 1655
        0xffffffffc076e55c:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0                      _
        19:   xor    %edx,%edx                |_ index (arg 3)
        1b:   movabs $0xffff88d95cc82600,%rsi |_ map (arg 2)
        25:   mov    %edx,%edx                |  index >= array->map.max_entries
        27:   cmp    %edx,0x24(%rsi)          |
        2a:   jbe    0x0000000000000066       |_
        2c:   mov    -0x224(%rbp),%eax        |  tail call limit check
        32:   cmp    $0x20,%eax               |
        35:   ja     0x0000000000000066       |
        37:   add    $0x1,%eax                |
        3a:   mov    %eax,-0x224(%rbp)        |_
        40:   mov    0xd0(%rsi,%rdx,8),%rax   |_ prog = array->ptrs[index]
        48:   test   %rax,%rax                |  prog == NULL check
        4b:   je     0x0000000000000066       |_
        4d:   mov    0x30(%rax),%rax          |  goto *(prog->bpf_func + prologue_size)
        51:   add    $0x19,%rax               |
        55:   callq  0x0000000000000061       |  retpoline for indirect jump
        5a:   pause                           |
        5c:   lfence                          |
        5f:   jmp    0x000000000000005a       |
        61:   mov    %rax,(%rsp)              |
        65:   retq                            |_
        66:   mov    $0x1,%eax
        6b:   pop    %rbx
        6c:   pop    %r15
        6e:   pop    %r14
        70:   pop    %r13
        72:   pop    %rbx
        73:   leaveq
        74:   retq
      
      After; state after JIT:
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0                      _
        19:   xor    %edx,%edx                |_ index (arg 3)
        1b:   movabs $0xffff9d8afd74c000,%rsi |_ map (arg 2)
        25:   mov    -0x224(%rbp),%eax        |  tail call limit check
        2b:   cmp    $0x20,%eax               |
        2e:   ja     0x000000000000003e       |
        30:   add    $0x1,%eax                |
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   jmpq   0xfffffffffffd1785       |_ [direct] goto *(prog->bpf_func + prologue_size)
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      After; state after map update (target prog):
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0
        19:   xor    %edx,%edx
        1b:   movabs $0xffff9d8afd74c000,%rsi
        25:   mov    -0x224(%rbp),%eax
        2b:   cmp    $0x20,%eax               .
        2e:   ja     0x000000000000003e       .
        30:   add    $0x1,%eax                .
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   jmpq   0xffffffffffb09f55       |_ goto *(prog->bpf_func + prologue_size)
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      After; state after map update (no prog):
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0
        19:   xor    %edx,%edx
        1b:   movabs $0xffff9d8afd74c000,%rsi
        25:   mov    -0x224(%rbp),%eax
        2b:   cmp    $0x20,%eax               .
        2e:   ja     0x000000000000003e       .
        30:   add    $0x1,%eax                .
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   nopl   0x0(%rax,%rax,1)         |_ fall-through nop
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      Nice bonus is that this also shrinks the code emission quite a bit
      for every tail call invocation.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/6ada4c1c9d35eeb5f4ecfab94593dafa6b5c4b09.1574452833.git.daniel@iogearbox.net
      428d5df1
    • D
      bpf, x86: Generalize and extend bpf_arch_text_poke for direct jumps · 4b3da77b
      Daniel Borkmann 提交于
      Add BPF_MOD_{NOP_TO_JUMP,JUMP_TO_JUMP,JUMP_TO_NOP} patching for x86
      JIT in order to be able to patch direct jumps or nop them out. We need
      this facility in order to patch tail call jumps and in later work also
      BPF static keys.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/aa4784196a8e5e985af4b30a4fe5336bce6e9643.1574452833.git.daniel@iogearbox.net
      4b3da77b
  12. 23 11月, 2019 4 次提交
    • J
      kvm: nVMX: Relax guest IA32_FEATURE_CONTROL constraints · 85c9aae9
      Jim Mattson 提交于
      Commit 37e4c997 ("KVM: VMX: validate individual bits of guest
      MSR_IA32_FEATURE_CONTROL") broke the KVM_SET_MSRS ABI by instituting
      new constraints on the data values that kvm would accept for the guest
      MSR, IA32_FEATURE_CONTROL. Perhaps these constraints should have been
      opt-in via a new KVM capability, but they were applied
      indiscriminately, breaking at least one existing hypervisor.
      
      Relax the constraints to allow either or both of
      FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX and
      FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX to be set when nVMX is
      enabled. This change is sufficient to fix the aforementioned breakage.
      
      Fixes: 37e4c997 ("KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85c9aae9
    • S
      KVM: x86: Grab KVM's srcu lock when setting nested state · ad5996d9
      Sean Christopherson 提交于
      Acquire kvm->srcu for the duration of ->set_nested_state() to fix a bug
      where nVMX derefences ->memslots without holding ->srcu or ->slots_lock.
      
      The other half of nested migration, ->get_nested_state(), does not need
      to acquire ->srcu as it is a purely a dump of internal KVM (and CPU)
      state to userspace.
      
      Detected as an RCU lockdep splat that is 100% reproducible by running
      KVM's state_test selftest with CONFIG_PROVE_LOCKING=y.  Note that the
      failing function, kvm_is_visible_gfn(), is only checking the validity of
      a gfn, it's not actually accessing guest memory (which is more or less
      unsupported during vmx_set_nested_state() due to incorrect MMU state),
      i.e. vmx_set_nested_state() itself isn't fundamentally broken.  In any
      case, setting nested state isn't a fast path so there's no reason to go
      out of our way to avoid taking ->srcu.
      
        =============================
        WARNING: suspicious RCU usage
        5.4.0-rc7+ #94 Not tainted
        -----------------------------
        include/linux/kvm_host.h:626 suspicious rcu_dereference_check() usage!
      
                     other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by evmcs_test/10939:
         #0: ffff88826ffcb800 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x630 [kvm]
      
        stack backtrace:
        CPU: 1 PID: 10939 Comm: evmcs_test Not tainted 5.4.0-rc7+ #94
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x68/0x9b
         kvm_is_visible_gfn+0x179/0x180 [kvm]
         mmu_check_root+0x11/0x30 [kvm]
         fast_cr3_switch+0x40/0x120 [kvm]
         kvm_mmu_new_cr3+0x34/0x60 [kvm]
         nested_vmx_load_cr3+0xbd/0x1f0 [kvm_intel]
         nested_vmx_enter_non_root_mode+0xab8/0x1d60 [kvm_intel]
         vmx_set_nested_state+0x256/0x340 [kvm_intel]
         kvm_arch_vcpu_ioctl+0x491/0x11a0 [kvm]
         kvm_vcpu_ioctl+0xde/0x630 [kvm]
         do_vfs_ioctl+0xa2/0x6c0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x54/0x200
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f59a2b95f47
      
      Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad5996d9
    • S
      KVM: x86: Open code shared_msr_update() in its only caller · 05c19c2f
      Sean Christopherson 提交于
      Fold shared_msr_update() into its sole user to eliminate its pointless
      bounds check, its godawful printk, its misleading comment (it's called
      under a global lock), and its woefully inaccurate name.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05c19c2f
    • S
      KVM: x86: Remove a spurious export of a static function · 24885d1d
      Sean Christopherson 提交于
      A recent change inadvertently exported a static function, which results
      in modpost throwing a warning.  Fix it.
      
      Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      24885d1d
  13. 22 11月, 2019 3 次提交
    • E
      crypto: x86/chacha - only unregister algorithms if registered · b62755ae
      Eric Biggers 提交于
      It's not valid to call crypto_unregister_skciphers() without a prior
      call to crypto_register_skciphers().
      
      Fixes: 84e03fa3 ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      b62755ae
    • D
      x86/hyperv: Implement hv_is_hibernation_supported() · b96f8653
      Dexuan Cui 提交于
      The API will be used by the hv_balloon and hv_vmbus drivers.
      
      Balloon up/down and hot-add of memory must not be active if the user
      wants the Linux VM to support hibernation, because they are incompatible
      with hibernation according to Hyper-V team, e.g. upon suspend the
      balloon VSP doesn't save any info about the ballooned-out pages (if any);
      so, after Linux resumes, Linux balloon VSC expects that the VSP will
      return the pages if Linux is under memory pressure, but the VSP will
      never do that, since the VSP thinks it never stole the pages from the VM.
      
      So, if the user wants Linux VM to support hibernation, Linux must forbid
      balloon up/down and hot-add, and the only functionality of the balloon VSC
      driver is reporting the VM's memory pressure to the host.
      
      Ideally, when Linux detects that the user wants it to support hibernation,
      the balloon VSC should tell the VSP that it does not support ballooning
      and hot-add. However, the current version of the VSP requires the VSC
      should support these capabilities, otherwise the capability negotiation
      fails and the VSC can not load at all, so with the later changes to the
      VSC driver, Linux VM still reports to the VSP that the VSC supports these
      capabilities, but the VSC ignores the VSP's requests of balloon up/down
      and hot add, and reports an error to the VSP, when applicable. BTW, in
      the future the balloon VSP driver will allow the VSC to not support the
      capabilities of balloon up/down and hot add.
      
      The ACPI S4 state is not a must for hibernation to work, because Linux is
      able to hibernate as long as the system can shut down. However in practice
      we decide to artificially use the presence of the virtual ACPI S4 state as
      an indicator of the user's intent of using hibernation, because Linux VM
      must find a way to know if the user wants to use the hibernation feature
      or not.
      
      By default, Hyper-V does not enable the virtual ACPI S4 state; on recent
      Hyper-V hosts (e.g. RS5, 19H1), the administrator is able to enable the
      state for a VM by WMI commands.
      
      Once all the vmbus and VSC patches for the hibernation feature are
      accepted, an extra patch will be submitted to forbid hibernation if the
      virtual ACPI S4 state is absent, i.e. hv_is_hibernation_supported() is
      false.
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b96f8653
    • H
      x86: hv: Add function to allocate zeroed page for Hyper-V · fa36dcdf
      Himadri Pandya 提交于
      Hyper-V assumes page size to be 4K. While this assumption holds true on
      x86 architecture, it might not  be true for ARM64 architecture. Hence
      define hyper-v specific function to allocate a zeroed page which can
      have a different implementation on ARM64 architecture to handle the
      conflict between hyper-v's assumed page size and actual guest page size.
      Signed-off-by: NHimadri Pandya <himadri18.07@gmail.com>
      Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fa36dcdf