1. 27 11月, 2019 1 次提交
    • I
      x86/iopl: Make 'struct tss_struct' constant size again · 0bcd7762
      Ingo Molnar 提交于
      After the following commit:
      
        05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      
      'struct cpu_entry_area' has to be Kconfig invariant, so that we always
      have a matching CPU_ENTRY_AREA_PAGES size.
      
      This commit added a CONFIG_X86_IOPL_IOPERM dependency to tss_struct:
      
        111e7b15: ("x86/ioperm: Extend IOPL config to control ioperm() as well")
      
      Which, if CONFIG_X86_IOPL_IOPERM is turned off, reduces the size of
      cpu_entry_area by two pages, triggering the assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_202’ declared with attribute error: BUILD_BUG_ON failed: (CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE
      
      Simplify the Kconfig dependencies and make cpu_entry_area constant
      size on 32-bit kernels again.
      
      Fixes: 05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0bcd7762
  2. 25 11月, 2019 7 次提交
    • A
      x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3 · 4a13b0e3
      Andy Lutomirski 提交于
      UNWIND_ESPFIX_STACK needs to read the GDT, and the GDT mapping that
      can be accessed via %fs is not mapped in the user pagetables.  Use
      SGDT to find the cpu_entry_area mapping and read the espfix offset
      from that instead.
      Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4a13b0e3
    • W
      locking/refcount: Consolidate implementations of refcount_t · fb041bb7
      Will Deacon 提交于
      The generic implementation of refcount_t should be good enough for
      everybody, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely,
      leaving the generic implementation enabled unconditionally.
      Signed-off-by: NWill Deacon <will@kernel.org>
      Reviewed-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Tested-by: NHanjun Guo <guohanjun@huawei.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191121115902.2551-9-will@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb041bb7
    • I
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the... · 05b042a1
      Ingo Molnar 提交于
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
      
      When two recent commits that increased the size of the 'struct cpu_entry_area'
      were merged in -tip, the 32-bit defconfig build started failing on the following
      build time assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
        arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’
        In function ‘setup_cpu_entry_area_ptes’,
      
      Which corresponds to the following build time assert:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      The purpose of this assert is to sanity check the fixed-value definition of
      CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h:
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 41)
      
      The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value
      we didn't want to define in such a low level header, because it would cause
      dependency hell.
      
      Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES
      accordingly - and this assert is checking that constraint.
      
      But the assert is both imprecise and buggy, primarily because it doesn't
      include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE
      (which begins at a PMD boundary).
      
      This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined
      too large upstream (v5.4-rc8):
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 40)
      
      While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra
      pages, which hid the bug.
      
      The following commit (not yet upstream) increased the size to 40 pages:
      
        x86/iopl: ("Restrict iopl() permission scope")
      
      ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap
      to just 1 extra page.
      
      Then another not-yet-upstream commit changed the size again:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Which increased the cpu_entry_area size from 38 to 39 pages, but
      didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked
      fine, because we still had a page left from the accidental 'reserve'.
      
      But when these two commits were merged into the same tree, the
      combined size of cpu_entry_area grew from 38 to 40 pages, while
      CPU_ENTRY_AREA_PAGES finally caught up to 40 as well.
      
      Which is fine in terms of functionality, but the assert broke:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area,
      which is 1 page larger due to the IDT page.
      
      To fix all this, change the assert to two precise asserts:
      
      	BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      	BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      
      This takes the IDT page into account, and also connects the size-based
      define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based
      define of CPU_ENTRY_AREA_MAP_SIZE.
      
      Also clean up some of the names which made it rather confusing:
      
       - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of
         the cpu-entry-area, but the per-cpu array size, so rename this
         to CPU_ENTRY_AREA_ARRAY_SIZE.
      
       - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping
         size, with the IDT included.
      
       - Add comments where '+1' denotes the IDT mapping - it wasn't
         obvious and took me about 3 hours to decode...
      
      Finally, because this particular commit is actually applied after
      this patch:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages.
      
      All future commits that change cpu_entry_area will have to adjust
      this value precisely.
      
      As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES
      and derive its value directly from the structure, without causing
      header hell - but that is an adventure for another day! :-)
      
      Fixes: 880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05b042a1
    • D
      bpf: Simplify __bpf_arch_text_poke poke type handling · b553a6ec
      Daniel Borkmann 提交于
      Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
      and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
      old_addr as well as new_addr, it's a bit redundant and unnecessarily
      complicates __bpf_arch_text_poke() itself since we can derive the same from
      the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
      as types which also allows to clean up call-sites.
      
      In addition to that, __bpf_arch_text_poke() currently verifies that text
      matches expected old_insn before we invoke text_poke_bp(). Also add a check
      on new_insn and skip rewrite if it already matches. Reason why this is rather
      useful is that it avoids making any special casing in prog_array_map_poke_run()
      when old and new prog were NULL and has the benefit that also for this case
      we perform a check on text whether it really matches our expectations.
      Suggested-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
      b553a6ec
    • D
      bpf, x86: Emit patchable direct jump as tail call · 428d5df1
      Daniel Borkmann 提交于
      Add initial code emission for *direct* jumps for tail call maps in
      order to avoid the retpoline overhead from a493a87f ("bpf, x64:
      implement retpoline for tail call") for situations that allow for
      it, meaning, for known constant keys at verification time which are
      used as index into the tail call map. In case of Cilium which makes
      heavy use of tail calls, constant keys are used in the vast majority,
      only for a single occurrence we use a dynamic key.
      
      High level outline is that if the target prog is NULL in the map, we
      emit a 5-byte nop for the fall-through case and if not, we emit a
      5-byte direct relative jmp to the target bpf_func + skipped prologue
      offset. Later during runtime, we patch these 5-byte nop/jmps upon
      tail call map update or deletions dynamically. Note that on x86-64
      the direct jmp works as we reuse the same stack frame and skip
      prologue (as opposed to some other JIT implementations).
      
      One of the issues is that the tail call map slots can change at any
      given time even during JITing. Therefore, we have two passes: i) emit
      nops for all patchable locations during main JITing phase until we
      declare prog->jited = 1 eventually. At this point the image is stable,
      not public yet and with all jmps disabled. While JITing, we collect
      additional info like poke->ip in order to remember the patch location
      for later modifications. In ii) bpf_tail_call_direct_fixup() walks
      over the progs poke_tab, locks the tail call maps poke_mutex to
      prevent from parallel updates and patches in the right locations via
      __bpf_arch_text_poke(). Note, the main bpf_arch_text_poke() cannot
      be used at this point since we're not yet exposed to kallsyms. For
      the update we use plain memcpy() since the image is not public and
      still in read-write mode. After patching, we activate that poke entry
      through poke->ip_stable. Meaning, at this point any tail call map
      updates/deletions are not going to ignore that poke entry anymore.
      Then, bpf_arch_text_poke() might still occur on the read-write image
      until we finally locked it as read-only. Both modifications on the
      given image are under text_mutex to avoid interference with each
      other when update requests come in in parallel for different tail
      call maps (current one we have locked in JIT and different one where
      poke->ip_stable was already set).
      
      Example prog:
      
        # ./bpftool p d x i 1655
         0: (b7) r3 = 0
         1: (18) r2 = map[id:526]
         3: (85) call bpf_tail_call#12
         4: (b7) r0 = 1
         5: (95) exit
      
      Before:
      
        # ./bpftool p d j i 1655
        0xffffffffc076e55c:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0                      _
        19:   xor    %edx,%edx                |_ index (arg 3)
        1b:   movabs $0xffff88d95cc82600,%rsi |_ map (arg 2)
        25:   mov    %edx,%edx                |  index >= array->map.max_entries
        27:   cmp    %edx,0x24(%rsi)          |
        2a:   jbe    0x0000000000000066       |_
        2c:   mov    -0x224(%rbp),%eax        |  tail call limit check
        32:   cmp    $0x20,%eax               |
        35:   ja     0x0000000000000066       |
        37:   add    $0x1,%eax                |
        3a:   mov    %eax,-0x224(%rbp)        |_
        40:   mov    0xd0(%rsi,%rdx,8),%rax   |_ prog = array->ptrs[index]
        48:   test   %rax,%rax                |  prog == NULL check
        4b:   je     0x0000000000000066       |_
        4d:   mov    0x30(%rax),%rax          |  goto *(prog->bpf_func + prologue_size)
        51:   add    $0x19,%rax               |
        55:   callq  0x0000000000000061       |  retpoline for indirect jump
        5a:   pause                           |
        5c:   lfence                          |
        5f:   jmp    0x000000000000005a       |
        61:   mov    %rax,(%rsp)              |
        65:   retq                            |_
        66:   mov    $0x1,%eax
        6b:   pop    %rbx
        6c:   pop    %r15
        6e:   pop    %r14
        70:   pop    %r13
        72:   pop    %rbx
        73:   leaveq
        74:   retq
      
      After; state after JIT:
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0                      _
        19:   xor    %edx,%edx                |_ index (arg 3)
        1b:   movabs $0xffff9d8afd74c000,%rsi |_ map (arg 2)
        25:   mov    -0x224(%rbp),%eax        |  tail call limit check
        2b:   cmp    $0x20,%eax               |
        2e:   ja     0x000000000000003e       |
        30:   add    $0x1,%eax                |
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   jmpq   0xfffffffffffd1785       |_ [direct] goto *(prog->bpf_func + prologue_size)
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      After; state after map update (target prog):
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0
        19:   xor    %edx,%edx
        1b:   movabs $0xffff9d8afd74c000,%rsi
        25:   mov    -0x224(%rbp),%eax
        2b:   cmp    $0x20,%eax               .
        2e:   ja     0x000000000000003e       .
        30:   add    $0x1,%eax                .
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   jmpq   0xffffffffffb09f55       |_ goto *(prog->bpf_func + prologue_size)
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      After; state after map update (no prog):
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0
        19:   xor    %edx,%edx
        1b:   movabs $0xffff9d8afd74c000,%rsi
        25:   mov    -0x224(%rbp),%eax
        2b:   cmp    $0x20,%eax               .
        2e:   ja     0x000000000000003e       .
        30:   add    $0x1,%eax                .
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   nopl   0x0(%rax,%rax,1)         |_ fall-through nop
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      Nice bonus is that this also shrinks the code emission quite a bit
      for every tail call invocation.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/6ada4c1c9d35eeb5f4ecfab94593dafa6b5c4b09.1574452833.git.daniel@iogearbox.net
      428d5df1
    • D
      bpf, x86: Generalize and extend bpf_arch_text_poke for direct jumps · 4b3da77b
      Daniel Borkmann 提交于
      Add BPF_MOD_{NOP_TO_JUMP,JUMP_TO_JUMP,JUMP_TO_NOP} patching for x86
      JIT in order to be able to patch direct jumps or nop them out. We need
      this facility in order to patch tail call jumps and in later work also
      BPF static keys.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/aa4784196a8e5e985af4b30a4fe5336bce6e9643.1574452833.git.daniel@iogearbox.net
      4b3da77b
    • E
      powerpc: Add const qual to local_read() parameter · c392bccf
      Eric Dumazet 提交于
      A patch in net-next triggered a compile error on powerpc:
      
        include/linux/u64_stats_sync.h: In function 'u64_stats_read':
        include/asm-generic/local64.h:30:37: warning: passing argument 1 of 'local_read' discards 'const' qualifier from pointer target type
      
      This seems reasonable to relax powerpc local_read() requirements.
      
      Fixes: 316580b6 ("u64_stats: provide u64_stats_t type")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> # build only
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      c392bccf
  3. 24 11月, 2019 2 次提交
    • T
      MIPS: SGI-IP27: Enable ethernet phy on second Origin 200 module · a8d0f11e
      Thomas Bogendoerfer 提交于
      PROM only enables ethernet PHY on first Origin 200 module, so we must
      do it ourselves for the second module.
      Signed-off-by: NThomas Bogendoerfer <tbogendoerfer@suse.de>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Lee Jones <lee.jones@linaro.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: linux-rtc@vger.kernel.org
      Cc: linux-serial@vger.kernel.org
      a8d0f11e
    • T
      MIPS: PCI: Fix fake subdevice ID for IOC3 · 29b261ff
      Thomas Bogendoerfer 提交于
      Generation of fake subdevice ID had vendor and device ID swapped.
      Signed-off-by: NThomas Bogendoerfer <tbogendoerfer@suse.de>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Lee Jones <lee.jones@linaro.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: linux-rtc@vger.kernel.org
      Cc: linux-serial@vger.kernel.org
      29b261ff
  4. 23 11月, 2019 10 次提交
    • J
      kvm: nVMX: Relax guest IA32_FEATURE_CONTROL constraints · 85c9aae9
      Jim Mattson 提交于
      Commit 37e4c997 ("KVM: VMX: validate individual bits of guest
      MSR_IA32_FEATURE_CONTROL") broke the KVM_SET_MSRS ABI by instituting
      new constraints on the data values that kvm would accept for the guest
      MSR, IA32_FEATURE_CONTROL. Perhaps these constraints should have been
      opt-in via a new KVM capability, but they were applied
      indiscriminately, breaking at least one existing hypervisor.
      
      Relax the constraints to allow either or both of
      FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX and
      FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX to be set when nVMX is
      enabled. This change is sufficient to fix the aforementioned breakage.
      
      Fixes: 37e4c997 ("KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85c9aae9
    • S
      KVM: x86: Grab KVM's srcu lock when setting nested state · ad5996d9
      Sean Christopherson 提交于
      Acquire kvm->srcu for the duration of ->set_nested_state() to fix a bug
      where nVMX derefences ->memslots without holding ->srcu or ->slots_lock.
      
      The other half of nested migration, ->get_nested_state(), does not need
      to acquire ->srcu as it is a purely a dump of internal KVM (and CPU)
      state to userspace.
      
      Detected as an RCU lockdep splat that is 100% reproducible by running
      KVM's state_test selftest with CONFIG_PROVE_LOCKING=y.  Note that the
      failing function, kvm_is_visible_gfn(), is only checking the validity of
      a gfn, it's not actually accessing guest memory (which is more or less
      unsupported during vmx_set_nested_state() due to incorrect MMU state),
      i.e. vmx_set_nested_state() itself isn't fundamentally broken.  In any
      case, setting nested state isn't a fast path so there's no reason to go
      out of our way to avoid taking ->srcu.
      
        =============================
        WARNING: suspicious RCU usage
        5.4.0-rc7+ #94 Not tainted
        -----------------------------
        include/linux/kvm_host.h:626 suspicious rcu_dereference_check() usage!
      
                     other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by evmcs_test/10939:
         #0: ffff88826ffcb800 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x630 [kvm]
      
        stack backtrace:
        CPU: 1 PID: 10939 Comm: evmcs_test Not tainted 5.4.0-rc7+ #94
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x68/0x9b
         kvm_is_visible_gfn+0x179/0x180 [kvm]
         mmu_check_root+0x11/0x30 [kvm]
         fast_cr3_switch+0x40/0x120 [kvm]
         kvm_mmu_new_cr3+0x34/0x60 [kvm]
         nested_vmx_load_cr3+0xbd/0x1f0 [kvm_intel]
         nested_vmx_enter_non_root_mode+0xab8/0x1d60 [kvm_intel]
         vmx_set_nested_state+0x256/0x340 [kvm_intel]
         kvm_arch_vcpu_ioctl+0x491/0x11a0 [kvm]
         kvm_vcpu_ioctl+0xde/0x630 [kvm]
         do_vfs_ioctl+0xa2/0x6c0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x54/0x200
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f59a2b95f47
      
      Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad5996d9
    • S
      KVM: x86: Open code shared_msr_update() in its only caller · 05c19c2f
      Sean Christopherson 提交于
      Fold shared_msr_update() into its sole user to eliminate its pointless
      bounds check, its godawful printk, its misleading comment (it's called
      under a global lock), and its woefully inaccurate name.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05c19c2f
    • S
      KVM: x86: Remove a spurious export of a static function · 24885d1d
      Sean Christopherson 提交于
      A recent change inadvertently exported a static function, which results
      in modpost throwing a warning.  Fix it.
      
      Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      24885d1d
    • Z
      MIPS: Ingenic: Disable abandoned HPTLB function. · b02efeb0
      Zhou Yanjie 提交于
      JZ4760/JZ4770/JZ4775/X1000/X1500 has an abandoned huge page tlb,
      this mode is not compatible with the MIPS standard, it will cause
      tlbmiss and into an infinite loop (line 21 in the tlb-funcs.S)
      when starting the init process. write 0xa9000000 to cp0 register 5
      sel 4 to disable this function to prevent getting stuck. Confirmed
      by Ingenic, this operation will not adversely affect processors
      without HPTLB function.
      Signed-off-by: NZhou Yanjie <zhouyanjie@zoho.com>
      Acked-by: NPaul Cercueil <paul@crapouillou.net>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: ralf@linux-mips.org
      Cc: jhogan@kernel.org
      Cc: jiaxun.yang@flygoat.com
      Cc: gregkh@linuxfoundation.org
      Cc: malat@debian.org
      Cc: tglx@linutronix.de
      Cc: chenhc@lemote.com
      b02efeb0
    • T
      MIPS: PCI: remember nasid changed by set interrupt affinity · 37640adb
      Thomas Bogendoerfer 提交于
      When changing interrupt affinity remember the possible changed nasid,
      otherwise an interrupt deactivate/activate sequence will incorrectly
      setup interrupt.
      
      Fixes: e6308b6d ("MIPS: SGI-IP27: abstract chipset irq from bridge")
      Signed-off-by: NThomas Bogendoerfer <tbogendoerfer@suse.de>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      37640adb
    • T
      MIPS: SGI-IP27: Fix crash, when CPUs are disabled via nr_cpus parameter · e3d765a9
      Thomas Bogendoerfer 提交于
      If number of CPUs are limited by the kernel commandline parameter nr_cpus
      assignment of interrupts accourding to numa rules might not be possibe.
      As a fallback use one of the online CPUs as interrupt destination.
      
      Fixes: 69a07a41 ("MIPS: SGI-IP27: rework HUB interrupts")
      Signed-off-by: NThomas Bogendoerfer <tbogendoerfer@suse.de>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      e3d765a9
    • M
      mips: add support for folded p4d page tables · 2bee1b58
      Mike Rapoport 提交于
      Implement primitives necessary for the 4th level folding, add walks of p4d
      level where appropriate, replace 5leve-fixup.h with pgtable-nop4d.h and
      drop usage of __ARCH_USE_5LEVEL_HACK.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Mike Rapoport <rppt@kernel.org>
      2bee1b58
    • M
      mips: drop __pXd_offset() macros that duplicate pXd_index() ones · 31168f03
      Mike Rapoport 提交于
      The __pXd_offset() macros are identical to the pXd_index() macros and there
      is no point to keep both of them. All architectures define and use
      pXd_index() so let's keep only those to make mips consistent with the rest
      of the kernel.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Mike Rapoport <rppt@kernel.org>
      31168f03
    • M
      mips: fix build when "48 bits virtual memory" is enabled · 3ed6751b
      Mike Rapoport 提交于
      With CONFIG_MIPS_VA_BITS_48=y the build fails miserably:
      
        CC      arch/mips/kernel/asm-offsets.s
      In file included from arch/mips/include/asm/pgtable.h:644,
                       from include/linux/mm.h:99,
                       from arch/mips/kernel/asm-offsets.c:15:
      include/asm-generic/pgtable.h:16:2: error: #error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
       #error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
        ^~~~~
      include/asm-generic/pgtable.h:390:28: error: unknown type name 'p4d_t'; did you mean 'pmd_t'?
       static inline int p4d_same(p4d_t p4d_a, p4d_t p4d_b)
                                  ^~~~~
                                  pmd_t
      
      [ ... more such errors ... ]
      
      scripts/Makefile.build:99: recipe for target 'arch/mips/kernel/asm-offsets.s' failed
      make[2]: *** [arch/mips/kernel/asm-offsets.s] Error 1
      
      This happens because when CONFIG_MIPS_VA_BITS_48 enables 4th level of the
      page tables, but neither pgtable-nop4d.h nor 5level-fixup.h are included to
      cope with the 5th level.
      
      Replace #ifdef conditions around includes of the pgtable-nop{m,u}d.h with
      explicit CONFIG_PGTABLE_LEVELS and add include of 5level-fixup.h for the
      case when CONFIG_PGTABLE_LEVELS==4
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NPaul Burton <paulburton@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Mike Rapoport <rppt@kernel.org>
      3ed6751b
  5. 22 11月, 2019 13 次提交
  6. 21 11月, 2019 7 次提交
    • P
      KVM: x86: create mmu/ subdirectory · c50d8ae3
      Paolo Bonzini 提交于
      Preparatory work for shattering mmu.c into multiple files.  Besides making it easier
      to follow, this will also make it possible to write unit tests for various parts.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c50d8ae3
    • L
      KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use apic-access-page · 0155b2b9
      Liran Alon 提交于
      According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use
      of the INVVPID/INVEPT Instruction, the hypervisor needs to execute
      INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and
      either: "Virtualize APIC accesses" VM-execution control was changed
      from 0 to 1, OR the value of apic_access_page was changed.
      
      In the nested case, the burden falls on L1, unless L0 enables EPT in
      vmcs02 but L1 enables neither EPT nor VPID in vmcs12.  For this reason
      prepare_vmcs02() and load_vmcs12_host_state() have special code to
      request a TLB flush in case L1 does not use EPT but it uses
      "virtualize APIC accesses".
      
      This special case however is not necessary. On a nested vmentry the
      physical TLB will already be flushed except if all the following apply:
      
      * L0 uses VPID
      
      * L1 uses VPID
      
      * L0 can guarantee TLB entries populated while running L1 are tagged
      differently than TLB entries populated while running L2.
      
      If the first condition is false, the processor will flush the TLB
      on vmentry to L2.  If the second or third condition are false,
      prepare_vmcs02() will request KVM_REQ_TLB_FLUSH.  However, even
      if both are true, no extra TLB flush is needed to handle the APIC
      access page:
      
      * if L1 doesn't use VPID, the second condition doesn't hold and the
      TLB will be flushed anyway.
      
      * if L1 uses VPID, it has to flush the TLB itself with INVVPID and
      section 28.3.3.3 doesn't apply to L0.
      
      * even INVEPT is not needed because, if L0 uses EPT, it uses different
      EPTP when running L2 than L1 (because guest_mode is part of mmu-role).
      In this case SDM section 28.3.3.4 doesn't apply.
      
      Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(),
      one could note that L0 won't flush TLB only in cases where SDM sections
      28.3.3.3 and 28.3.3.4 don't apply.  In particular, if L0 uses different
      VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section
      28.3.3.3 doesn't apply.
      
      Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit().
      
      Side-note: This patch can be viewed as removing parts of commit
      fb6c8198 ("kvm: vmx: Flush TLB when the APIC-access address changes”)
      that is not relevant anymore since commit
      1313cc2b ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”).
      i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT,
      then L0 will use same EPTP for both L0 and L1. Which indeed required
      L0 to execute INVEPT before entering L2 guest. This assumption is
      not true anymore since when guest_mode was added to mmu-role.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0155b2b9
    • M
      KVM: x86: remove set but not used variable 'called' · db5a95ec
      Mao Wenan 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask:
      arch/x86/kvm/x86.c:7911:7: warning: variable called set but not
      used [-Wunused-but-set-variable]
      
      It is not used since commit 7ee30bc1 ("KVM: x86: deliver KVM
      IOAPIC scan request to target vCPUs")
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Fixes: 7ee30bc1 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db5a95ec
    • L
      KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinning · b11494bc
      Liran Alon 提交于
      vmcs->apic_access_page is simply a token that the hypervisor puts into
      the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers
      APIC-access VMExit or APIC virtualization logic whenever a CPU running
      in VMX non-root mode read/write from/to this PFN.
      
      As every write either triggers an APIC-access VMExit or write is
      performed on vmcs->virtual_apic_page, the PFN pointed to by
      vmcs->apic_access_page should never actually be touched by CPU.
      
      Therefore, there is no need to mark vmcs02->apic_access_page as dirty
      after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation.
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b11494bc
    • P
      KVM: vmx: use MSR_IA32_TSX_CTRL to hard-disable TSX on guest that lack it · b07a5c53
      Paolo Bonzini 提交于
      If X86_FEATURE_RTM is disabled, the guest should not be able to access
      MSR_IA32_TSX_CTRL.  We can therefore use it in KVM to force all
      transactions from the guest to abort.
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b07a5c53
    • P
      KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality · c11f83e0
      Paolo Bonzini 提交于
      The current guest mitigation of TAA is both too heavy and not really
      sufficient.  It is too heavy because it will cause some affected CPUs
      (those that have MDS_NO but lack TAA_NO) to fall back to VERW and
      get the corresponding slowdown.  It is not really sufficient because
      it will cause the MDS_NO bit to disappear upon microcode update, so
      that VMs started before the microcode update will not be runnable
      anymore afterwards, even with tsx=on.
      
      Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
      the guest and let it run without the VERW mitigation.  Even though
      MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
      it on every vmentry, we can use the shared MSR functionality because
      the host kernel need not protect itself from TSX-based side-channels.
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c11f83e0
    • P
      KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID · edef5c36
      Paolo Bonzini 提交于
      Because KVM always emulates CPUID, the CPUID clear bit
      (bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
      by the hypervisor when performing said emulation.
      
      Right now neither kvm-intel.ko nor kvm-amd.ko implement
      MSR_IA32_TSX_CTRL but this will change in the next patch.
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      edef5c36