1. 08 4月, 2019 2 次提交
  2. 03 4月, 2019 1 次提交
  3. 30 3月, 2019 1 次提交
    • B
      x86/cpufeature: Remove __pure attribute to _static_cpu_has() · ae37a8cd
      Borislav Petkov 提交于
      __pure is used to make gcc do Common Subexpression Elimination (CSE)
      and thus save subsequent invocations of a function which does a complex
      computation (without side effects). As a simple example:
      
        bool a = _static_cpu_has(x);
        bool b = _static_cpu_has(x);
      
      gets turned into
      
        bool a = _static_cpu_has(x);
        bool b = a;
      
      However, gcc doesn't do CSE with asm()s when those get inlined - like it
      is done with _static_cpu_has() - because, for example, the t_yes/t_no
      labels are different for each inlined function body and thus cannot be
      detected as equivalent anymore for the CSE heuristic to hit.
      
      However, this all is beside the point because best it should be avoided
      to have more than one call to _static_cpu_has(X) in the same function
      due to the fact that each such call is an alternatives patch site and it
      is simply pointless.
      
      Therefore, drop the __pure attribute as it is not doing anything.
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: x86@kernel.org
      Link: https://lkml.kernel.org/r/20190307151036.GD26566@zn.tnic
      ae37a8cd
  4. 17 3月, 2019 3 次提交
  5. 16 3月, 2019 2 次提交
    • P
      kvm: vmx: fix formatting of a comment · 4a605bc0
      Paolo Bonzini 提交于
      Eliminate a gratuitous conflict with 5.0.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4a605bc0
    • B
      Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()" · 92da008f
      Ben Gardon 提交于
      This reverts commit 71883a62.
      
      The above commit contains an optimization to kvm_zap_gfn_range which
      uses gfn-limited TLB flushes, if enabled. If using these limited flushes,
      kvm_zap_gfn_range passes lock_flush_tlb=false to slot_handle_level_range
      which creates a race when the function unlocks to call cond_resched.
      See an example of this race below:
      
      CPU 0                   CPU 1                           CPU 3
      // zap_direct_gfn_range
      mmu_lock()
      // *ptep == pte_1
      *ptep = 0
      if (lock_flush_tlb)
              flush_tlbs()
      mmu_unlock()
                              // In invalidate range
                              // MMU notifier
                              mmu_lock()
                              if (pte != 0)
                                      *ptep = 0
                                      flush = true
                              if (flush)
                                      flush_remote_tlbs()
                              mmu_unlock()
                              return
                              // Host MM reallocates
                              // page previously
                              // backing guest memory.
                                                              // Guest accesses
                                                              // invalid page
                                                              // through pte_1
                                                              // in its TLB!!
      
      Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
      	without this patch. The patch introduced no new failures.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      92da008f
  6. 15 3月, 2019 2 次提交
  7. 13 3月, 2019 3 次提交
    • M
      memblock: drop memblock_alloc_*_nopanic() variants · 26fb3dae
      Mike Rapoport 提交于
      As all the memblock allocation functions return NULL in case of error
      rather than panic(), the duplicates with _nopanic suffix can be removed.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: Petr Mladek <pmladek@suse.com>		[printk]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26fb3dae
    • M
      treewide: add checks for the return value of memblock_alloc*() · 8a7f97b9
      Mike Rapoport 提交于
      Add check for the return value of memblock_alloc*() functions and call
      panic() in case of error.  The panic message repeats the one used by
      panicing memblock allocators with adjustment of parameters to include
      only relevant ones.
      
      The replacement was mostly automated with semantic patches like the one
      below with manual massaging of format strings.
      
        @@
        expression ptr, size, align;
        @@
        ptr = memblock_alloc(size, align);
        + if (!ptr)
        + 	panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);
      
      [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
        Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
      [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
        Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
      [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
        Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
      [akpm@linux-foundation.org: fix xtensa printk warning]
      Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: Guo Ren <ren_guo@c-sky.com>		[c-sky]
      Acked-by: Paul Burton <paul.burton@mips.com>		[MIPS]
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
      Reviewed-by: Juergen Gross <jgross@suse.com>		[Xen]
      Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: Max Filippov <jcmvbkbc@gmail.com>		[xtensa]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a7f97b9
    • M
      memblock: drop __memblock_alloc_base() · 42b46aef
      Mike Rapoport 提交于
      The __memblock_alloc_base() function tries to allocate a memory up to
      the limit specified by its max_addr parameter.  Depending on the value
      of this parameter, the __memblock_alloc_base() can is replaced with the
      appropriate memblock_phys_alloc*() variant.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-9-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NRob Herring <robh@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42b46aef
  8. 09 3月, 2019 2 次提交
    • K
      perf/x86/intel/uncore: Fix client IMC events return huge result · 8041ffd3
      Kan Liang 提交于
      The client IMC bandwidth events currently return very large values:
      
        $ perf stat -e uncore_imc/data_reads/ -e uncore_imc/data_writes/ -I 10000 -a
      
        10.000117222 34,788.76 MiB uncore_imc/data_reads/
        10.000117222 8.26 MiB uncore_imc/data_writes/
        20.000374584 34,842.89 MiB uncore_imc/data_reads/
        20.000374584 10.45 MiB uncore_imc/data_writes/
        30.000633299 37,965.29 MiB uncore_imc/data_reads/
        30.000633299 323.62 MiB uncore_imc/data_writes/
        40.000891548 41,012.88 MiB uncore_imc/data_reads/
        40.000891548 6.98 MiB uncore_imc/data_writes/
        50.001142480 1,125,899,906,621,494.75 MiB uncore_imc/data_reads/
        50.001142480 6.97 MiB uncore_imc/data_writes/
      
      The client IMC events are freerunning counters. They still use the
      old event encoding format (0x1 for data_read and 0x2 for data write).
      The counter bit width is calculated by common code, which assume that
      the standard encoding format is used for the freerunning counters.
      Error bit width information is calculated.
      
      The patch intends to convert the old client IMC event encoding to the
      standard encoding format.
      
      Current common code uses event->attr.config which directly copy from
      user space. We should not implicitly modify it for a converted event.
      The event->hw.config is used to replace the event->attr.config in
      common code.
      
      For client IMC events, the event->attr.config is used to calculate a
      converted event with standard encoding format in the custom
      event_init(). The converted event is stored in event->hw.config.
      For other events of freerunning counters, they already use the standard
      encoding format. The same value as event->attr.config is assigned to
      event->hw.config in common event_init().
      Reported-by: NJin Yao <yao.jin@linux.intel.com>
      Tested-by: NJin Yao <yao.jin@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: stable@kernel.org # v4.18+
      Fixes: 9aae1780 ("perf/x86/intel/uncore: Clean up client IMC uncore")
      Link: https://lkml.kernel.org/r/20190227165729.1861-1-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8041ffd3
    • J
      xen: fix dom0 boot on huge systems · 01bd2ac2
      Juergen Gross 提交于
      Commit f7c90c2a ("x86/xen: don't write ptes directly in 32-bit
      PV guests") introduced a regression for booting dom0 on huge systems
      with lots of RAM (in the TB range).
      
      Reason is that on those hosts the p2m list needs to be moved early in
      the boot process and this requires temporary page tables to be created.
      Said commit modified xen_set_pte_init() to use a hypercall for writing
      a PTE, but this requires the page table being in the direct mapped
      area, which is not the case for the temporary page tables used in
      xen_relocate_p2m().
      
      As the page tables are completely written before being linked to the
      actual address space instead of set_pte() a plain write to memory can
      be used in xen_relocate_p2m().
      
      Fixes: f7c90c2a ("x86/xen: don't write ptes directly in 32-bit PV guests")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      01bd2ac2
  9. 08 3月, 2019 2 次提交
  10. 07 3月, 2019 7 次提交
    • K
      x86/hyperv: Fix kernel panic when kexec on HyperV · 179fb36a
      Kairui Song 提交于
      After commit 68bb7bfb ("X86/Hyper-V: Enable IPI enlightenments"),
      kexec fails with a kernel panic:
      
      kexec_core: Starting new kernel
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v3.0 03/02/2018
      RIP: 0010:0xffffc9000001d000
      
      Call Trace:
       ? __send_ipi_mask+0x1c6/0x2d0
       ? hv_send_ipi_mask_allbutself+0x6d/0xb0
       ? mp_save_irq+0x70/0x70
       ? __ioapic_read_entry+0x32/0x50
       ? ioapic_read_entry+0x39/0x50
       ? clear_IO_APIC_pin+0xb8/0x110
       ? native_stop_other_cpus+0x6e/0x170
       ? native_machine_shutdown+0x22/0x40
       ? kernel_kexec+0x136/0x156
      
      That happens if hypercall based IPIs are used because the hypercall page is
      reset very early upon kexec reboot, but kexec sends IPIs to stop CPUs,
      which invokes the hypercall and dereferences the unusable page.
      
      To fix his, reset hv_hypercall_pg to NULL before the page is reset to avoid
      any misuse, IPI sending will fall back to the non hypercall based
      method. This only happens on kexec / kdump so just setting the pointer to
      NULL is good enough.
      
      Fixes: 68bb7bfb ("X86/Hyper-V: Enable IPI enlightenments")
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: devel@linuxdriverproject.org
      Link: https://lkml.kernel.org/r/20190306111827.14131-1-kasong@redhat.com
      179fb36a
    • Q
      x86/mm: Remove unused variable 'old_pte' · 24c41220
      Qian Cai 提交于
      The commit 3a19109e ("x86/mm: Fix try_preserve_large_page() to
      handle large PAT bit") fixed try_preserve_large_page() by using the
      corresponding pud/pmd prot/pfn interfaces, but left a variable unused
      because it no longer used pte_pfn().
      
      Later, the commit 8679de09 ("x86/mm/cpa: Split, rename and clean up
      try_preserve_large_page()") renamed try_preserve_large_page() to
      __should_split_large_page(), but the unused variable remains.
      
      arch/x86/mm/pageattr.c: In function '__should_split_large_page':
      arch/x86/mm/pageattr.c:741:17: warning: variable 'old_pte' set but not
      used [-Wunused-but-set-variable]
      
      Fixes: 3a19109e ("x86/mm: Fix try_preserve_large_page() to handle large PAT bit")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: dave.hansen@linux.intel.com
      Cc: luto@kernel.org
      Cc: peterz@infradead.org
      Cc: toshi.kani@hpe.com
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20190301152924.94762-1-cai@lca.pw
      24c41220
    • Q
      x86/mm: Remove unused variable 'cpu' · 3609e31b
      Qian Cai 提交于
      The commit a2055abe ("x86/mm: Pass flush_tlb_info to
      flush_tlb_others() etc") removed the unnecessary cpu parameter from
      uv_flush_tlb_others() but left an unused variable.
      
      arch/x86/mm/tlb.c: In function 'native_flush_tlb_others':
      arch/x86/mm/tlb.c:688:16: warning: variable 'cpu' set but not used
      [-Wunused-but-set-variable]
         unsigned int cpu;
                      ^~~
      
      Fixes: a2055abe ("x86/mm: Pass flush_tlb_info to flush_tlb_others() etc")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NAndyt Lutomirski <luto@kernel.org>
      Cc: dave.hansen@linux.intel.com
      Cc: peterz@infradead.org
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20190228220155.88124-1-cai@lca.pw
      3609e31b
    • Q
      Revert "x86_64: Increase stack size for KASAN_EXTRA" · a2863b53
      Qian Cai 提交于
      This reverts commit a8e911d1.
      KASAN_EXTRA was removed via the commit 7771bdbb ("kasan: remove use
      after scope bugs detection."), so this is no longer needed.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: bp@alien8.de
      Cc: akpm@linux-foundation.org
      Cc: aryabinin@virtuozzo.com
      Cc: glider@google.com
      Cc: dvyukov@google.com
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20190306213806.46139-1-cai@lca.pw
      a2863b53
    • J
      x86/unwind: Add hardcoded ORC entry for NULL · ac5ceccc
      Jann Horn 提交于
      When the ORC unwinder is invoked for an oops caused by IP==0,
      it currently has no idea what to do because there is no debug information
      for the stack frame of NULL.
      
      But if RIP is NULL, it is very likely that the last successfully executed
      instruction was an indirect CALL/JMP, and it is possible to unwind out in
      the same way as for the first instruction of a normal function. Hardcode
      a corresponding ORC entry.
      
      With an artificially-added NULL call in prctl_set_seccomp(), before this
      patch, the trace is:
      
      Call Trace:
       ? __x64_sys_prctl+0x402/0x680
       ? __ia32_sys_prctl+0x6e0/0x6e0
       ? __do_page_fault+0x457/0x620
       ? do_syscall_64+0x6d/0x160
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      After this patch, the trace looks like this:
      
      Call Trace:
       __x64_sys_prctl+0x402/0x680
       ? __ia32_sys_prctl+0x6e0/0x6e0
       ? __do_page_fault+0x457/0x620
       do_syscall_64+0x6d/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      prctl_set_seccomp() still doesn't show up in the trace because for some
      reason, tail call optimization is only disabled in builds that use the
      frame pointer unwinder.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: syzbot <syzbot+ca95b2b7aef9e7cbd6ab@syzkaller.appspotmail.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michal Marek <michal.lkml@markovi.net>
      Cc: linux-kbuild@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190301031201.7416-2-jannh@google.com
      ac5ceccc
    • J
      x86/unwind: Handle NULL pointer calls better in frame unwinder · f4f34e1b
      Jann Horn 提交于
      When the frame unwinder is invoked for an oops caused by a call to NULL, it
      currently skips the parent function because BP still points to the parent's
      stack frame; the (nonexistent) current function only has the first half of
      a stack frame, and BP doesn't point to it yet.
      
      Add a special case for IP==0 that calculates a fake BP from SP, then uses
      the real BP for the next frame.
      
      Note that this handles first_frame specially: Return information about the
      parent function as long as the saved IP is >=first_frame, even if the fake
      BP points below it.
      
      With an artificially-added NULL call in prctl_set_seccomp(), before this
      patch, the trace is:
      
      Call Trace:
       ? prctl_set_seccomp+0x3a/0x50
       __x64_sys_prctl+0x457/0x6f0
       ? __ia32_sys_prctl+0x750/0x750
       do_syscall_64+0x72/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      After this patch, the trace is:
      
      Call Trace:
       prctl_set_seccomp+0x3a/0x50
       __x64_sys_prctl+0x457/0x6f0
       ? __ia32_sys_prctl+0x750/0x750
       do_syscall_64+0x72/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: syzbot <syzbot+ca95b2b7aef9e7cbd6ab@syzkaller.appspotmail.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michal Marek <michal.lkml@markovi.net>
      Cc: linux-kbuild@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190301031201.7416-1-jannh@google.com
      f4f34e1b
    • L
      x86/boot/KASLR: Always return a value from process_mem_region · e4a0bd03
      Louis Taylor 提交于
      When compiling with -Wreturn-type, clang warns:
      
      arch/x86/boot/compressed/kaslr.c:704:1: warning: control may reach end of
            non-void function [-Wreturn-type]
      
      This function's return statement should have been placed outside the
      ifdeffed region. Move it there.
      
      Fixes: 690eaa53 ("x86/boot/KASLR: Limit KASLR to extract the kernel in immovable memory only")
      Signed-off-by: NLouis Taylor <louis@kragniz.eu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: fanc.fnst@cn.fujitsu.com
      Cc: bhe@redhat.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: jflat@chromium.org
      Link: https://lkml.kernel.org/r/20190302184929.28971-1-louis@kragniz.eu
      e4a0bd03
  11. 06 3月, 2019 12 次提交
    • P
      perf/x86/intel: Implement support for TSX Force Abort · 400816f6
      Peter Zijlstra (Intel) 提交于
      Skylake (and later) will receive a microcode update to address a TSX
      errata. This microcode will, on execution of a TSX instruction
      (speculative or not) use (clobber) PMC3. This update will also provide
      a new MSR to change this behaviour along with a CPUID bit to enumerate
      the presence of this new MSR.
      
      When the MSR gets set; the microcode will no longer use PMC3 but will
      Force Abort every TSX transaction (upon executing COMMIT).
      
      When TSX Force Abort (TFA) is allowed (default); the MSR gets set when
      PMC3 gets scheduled and cleared when, after scheduling, PMC3 is
      unused.
      
      When TFA is not allowed; clear PMC3 from all constraints such that it
      will not get used.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      400816f6
    • P
      x86: Add TSX Force Abort CPUID/MSR · 52f64909
      Peter Zijlstra (Intel) 提交于
      Skylake systems will receive a microcode update to address a TSX
      errata. This microcode will (by default) clobber PMC3 when TSX
      instructions are (speculatively or not) executed.
      
      It also provides an MSR to cause all TSX transaction to abort and
      preserve PMC3.
      
      Add the CPUID enumeration and MSR definition.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      52f64909
    • P
      perf/x86/intel: Generalize dynamic constraint creation · 11f8b2d6
      Peter Zijlstra (Intel) 提交于
      Such that we can re-use it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      11f8b2d6
    • P
      perf/x86/intel: Make cpuc allocations consistent · d01b1f96
      Peter Zijlstra (Intel) 提交于
      The cpuc data structure allocation is different between fake and real
      cpuc's; use the same code to init/free both.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d01b1f96
    • M
      docs/core-api/mm: fix user memory accessors formatting · bc8ff3ca
      Mike Rapoport 提交于
      The descriptions of userspace memory access functions had minor issues
      with formatting that made kernel-doc unable to properly detect the
      function/macro names and the return value sections:
      
      ./arch/x86/include/asm/uaccess.h:80: info: Scanning doc for
      ./arch/x86/include/asm/uaccess.h:139: info: Scanning doc for
      ./arch/x86/include/asm/uaccess.h:231: info: Scanning doc for
      ./arch/x86/include/asm/uaccess.h:505: info: Scanning doc for
      ./arch/x86/include/asm/uaccess.h:530: info: Scanning doc for
      ./arch/x86/lib/usercopy_32.c:58: info: Scanning doc for
      ./arch/x86/lib/usercopy_32.c:69: warning: No description found for return
      value of 'clear_user'
      ./arch/x86/lib/usercopy_32.c:78: info: Scanning doc for
      ./arch/x86/lib/usercopy_32.c:90: warning: No description found for return
      value of '__clear_user'
      
      Fix the formatting.
      
      Link: http://lkml.kernel.org/r/1549549644-4903-3-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc8ff3ca
    • A
      numa: make "nr_node_ids" unsigned int · b9726c26
      Alexey Dobriyan 提交于
      Number of NUMA nodes can't be negative.
      
      This saves a few bytes on x86_64:
      
      	add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
      	Function                                     old     new   delta
      	hv_synic_alloc.cold                           88     110     +22
      	prealloc_shrinker                            260     262      +2
      	bootstrap                                    249     251      +2
      	sched_init_numa                             1566    1567      +1
      	show_slab_objects                            778     777      -1
      	s_show                                      1201    1200      -1
      	kmem_cache_init                              346     345      -1
      	__alloc_workqueue_key                       1146    1145      -1
      	mem_cgroup_css_alloc                        1614    1612      -2
      	__do_sys_swapon                             4702    4699      -3
      	__list_lru_init                              655     651      -4
      	nic_probe                                   2379    2374      -5
      	store_user_store                             118     111      -7
      	red_zone_store                               106      99      -7
      	poison_store                                 106      99      -7
      	wq_numa_init                                 348     338     -10
      	__kmem_cache_empty                            75      65     -10
      	task_numa_free                               186     173     -13
      	merge_across_nodes_store                     351     336     -15
      	irq_create_affinity_masks                   1261    1246     -15
      	do_numa_crng_init                            343     321     -22
      	task_numa_fault                             4760    4737     -23
      	swapfile_init                                179     156     -23
      	hv_synic_alloc                               536     492     -44
      	apply_wqattrs_prepare                        746     695     -51
      
      Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9726c26
    • A
      mm: update ptep_modify_prot_commit to take old pte value as arg · 04a86453
      Aneesh Kumar K.V 提交于
      Architectures like ppc64 require to do a conditional tlb flush based on
      the old and new value of pte.  Enable that by passing old pte value as
      the arg.
      
      Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04a86453
    • A
      mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg · 0cbe3e26
      Aneesh Kumar K.V 提交于
      Patch series "NestMMU pte upgrade workaround for mprotect", v5.
      
      We can upgrade pte access (R -> RW transition) via mprotect.  We need to
      make sure we follow the recommended pte update sequence as outlined in
      commit bd5050e3 ("powerpc/mm/radix: Change pte relax sequence to
      handle nest MMU hang") for such updates.  This patch series does that.
      
      This patch (of 5):
      
      Some architectures may want to call flush_tlb_range from these helpers.
      
      Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cbe3e26
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
    • B
      x86: Deprecate a.out support · eac61655
      Borislav Petkov 提交于
      Linux supports ELF binaries for ~25 years now.  a.out coredumping has
      bitrotten quite significantly and would need some fixing to get it into
      shape again but considering how even the toolchains cannot create a.out
      executables in its default configuration, let's deprecate a.out support
      and remove it a couple of releases later, instead.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-fsdevel@vger.kernel.org>
      Cc: lkml <linux-kernel@vger.kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <x86@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eac61655
    • L
      a.out: remove core dumping support · 08300f44
      Linus Torvalds 提交于
      We're (finally) phasing out a.out support for good.  As Borislav Petkov
      points out, we've supported ELF binaries for about 25 years by now, and
      coredumping in particular has bitrotted over the years.
      
      None of the tool chains even support generating a.out binaries any more,
      and the plan is to deprecate a.out support entirely for the kernel.  But
      I want to start with just removing the core dumping code, because I can
      still imagine that somebody actually might want to support a.out as a
      simpler biinary format.
      
      Particularly if you generate some random binaries on the fly, ELF is a
      much more complicated format (admittedly ELF also does have a lot of
      toolchain support, mitigating that complexity a lot and you really
      should have moved over in the last 25 years).
      
      So it's at least somewhat possible that somebody out there has some
      workflow that still involves generating and running a.out executables.
      
      In contrast, it's very unlikely that anybody depends on debugging any
      legacy a.out core files.  But regardless, I want this phase-out to be
      done in two steps, so that we can resurrect a.out support (if needed)
      without having to resurrect the core file dumping that is almost
      certainly not needed.
      
      Jann Horn pointed to the <asm/a.out-core.h> file that my first trivial
      cut at this had missed.
      
      And Alan Cox points out that the a.out binary loader _could_ be done in
      user space if somebody wants to, but we might keep just the loader in
      the kernel if somebody really wants it, since the loader isn't that big
      and has no really odd special cases like the core dumping does.
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08300f44
    • C
      signal: add pidfd_send_signal() syscall · 3eb39f47
      Christian Brauner 提交于
      The kill() syscall operates on process identifiers (pid). After a process
      has exited its pid can be reused by another process. If a caller sends a
      signal to a reused pid it will end up signaling the wrong process. This
      issue has often surfaced and there has been a push to address this problem [1].
      
      This patch uses file descriptors (fd) from proc/<pid> as stable handles on
      struct pid. Even if a pid is recycled the handle will not change. The fd
      can be used to send signals to the process it refers to.
      Thus, the new syscall pidfd_send_signal() is introduced to solve this
      problem. Instead of pids it operates on process fds (pidfd).
      
      /* prototype and argument /*
      long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
      
      /* syscall number 424 */
      The syscall number was chosen to be 424 to align with Arnd's rework in his
      y2038 to minimize merge conflicts (cf. [25]).
      
      In addition to the pidfd and signal argument it takes an additional
      siginfo_t and flags argument. If the siginfo_t argument is NULL then
      pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
      is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
      The flags argument is added to allow for future extensions of this syscall.
      It currently needs to be passed as 0. Failing to do so will cause EINVAL.
      
      /* pidfd_send_signal() replaces multiple pid-based syscalls */
      The pidfd_send_signal() syscall currently takes on the job of
      rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
      positive pid is passed to kill(2). It will however be possible to also
      replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
      
      /* sending signals to threads (tid) and process groups (pgid) */
      Specifically, the pidfd_send_signal() syscall does currently not operate on
      process groups or threads. This is left for future extensions.
      In order to extend the syscall to allow sending signal to threads and
      process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
      PIDFD_TYPE_TID) should be added. This implies that the flags argument will
      determine what is signaled and not the file descriptor itself. Put in other
      words, grouping in this api is a property of the flags argument not a
      property of the file descriptor (cf. [13]). Clarification for this has been
      requested by Eric (cf. [19]).
      When appropriate extensions through the flags argument are added then
      pidfd_send_signal() can additionally replace the part of kill(2) which
      operates on process groups as well as the tgkill(2) and
      rt_tgsigqueueinfo(2) syscalls.
      How such an extension could be implemented has been very roughly sketched
      in [14], [15], and [16]. However, this should not be taken as a commitment
      to a particular implementation. There might be better ways to do it.
      Right now this is intentionally left out to keep this patchset as simple as
      possible (cf. [4]).
      
      /* naming */
      The syscall had various names throughout iterations of this patchset:
      - procfd_signal()
      - procfd_send_signal()
      - taskfd_send_signal()
      In the last round of reviews it was pointed out that given that if the
      flags argument decides the scope of the signal instead of different types
      of fds it might make sense to either settle for "procfd_" or "pidfd_" as
      prefix. The community was willing to accept either (cf. [17] and [18]).
      Given that one developer expressed strong preference for the "pidfd_"
      prefix (cf. [13]) and with other developers less opinionated about the name
      we should settle for "pidfd_" to avoid further bikeshedding.
      
      The  "_send_signal" suffix was chosen to reflect the fact that the syscall
      takes on the job of multiple syscalls. It is therefore intentional that the
      name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
      fomer because it might imply that pidfd_send_signal() is a replacement for
      kill(2), and not the latter because it is a hassle to remember the correct
      spelling - especially for non-native speakers - and because it is not
      descriptive enough of what the syscall actually does. The name
      "pidfd_send_signal" makes it very clear that its job is to send signals.
      
      /* zombies */
      Zombies can be signaled just as any other process. No special error will be
      reported since a zombie state is an unreliable state (cf. [3]). However,
      this can be added as an extension through the @flags argument if the need
      ever arises.
      
      /* cross-namespace signals */
      The patch currently enforces that the signaler and signalee either are in
      the same pid namespace or that the signaler's pid namespace is an ancestor
      of the signalee's pid namespace. This is done for the sake of simplicity
      and because it is unclear to what values certain members of struct
      siginfo_t would need to be set to (cf. [5], [6]).
      
      /* compat syscalls */
      It became clear that we would like to avoid adding compat syscalls
      (cf. [7]).  The compat syscall handling is now done in kernel/signal.c
      itself by adding __copy_siginfo_from_user_generic() which lets us avoid
      compat syscalls (cf. [8]). It should be noted that the addition of
      __copy_siginfo_from_user_any() is caused by a bug in the original
      implementation of rt_sigqueueinfo(2) (cf. 12).
      With upcoming rework for syscall handling things might improve
      significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
      any additional callers.
      
      /* testing */
      This patch was tested on x64 and x86.
      
      /* userspace usage */
      An asciinema recording for the basic functionality can be found under [9].
      With this patch a process can be killed via:
      
       #define _GNU_SOURCE
       #include <errno.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/stat.h>
       #include <sys/syscall.h>
       #include <sys/types.h>
       #include <unistd.h>
      
       static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
                                               unsigned int flags)
       {
       #ifdef __NR_pidfd_send_signal
               return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
       #else
               return -ENOSYS;
       #endif
       }
      
       int main(int argc, char *argv[])
       {
               int fd, ret, saved_errno, sig;
      
               if (argc < 3)
                       exit(EXIT_FAILURE);
      
               fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
               if (fd < 0) {
                       printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               sig = atoi(argv[2]);
      
               printf("Sending signal %d to process %s\n", sig, argv[1]);
               ret = do_pidfd_send_signal(fd, sig, NULL, 0);
      
               saved_errno = errno;
               close(fd);
               errno = saved_errno;
      
               if (ret < 0) {
                       printf("%s - Failed to send signal %d to process %s\n",
                              strerror(errno), sig, argv[1]);
                       exit(EXIT_FAILURE);
               }
      
               exit(EXIT_SUCCESS);
       }
      
      /* Q&A
       * Given that it seems the same questions get asked again by people who are
       * late to the party it makes sense to add a Q&A section to the commit
       * message so it's hopefully easier to avoid duplicate threads.
       *
       * For the sake of progress please consider these arguments settled unless
       * there is a new point that desperately needs to be addressed. Please make
       * sure to check the links to the threads in this commit message whether
       * this has not already been covered.
       */
      Q-01: (Florian Weimer [20], Andrew Morton [21])
            What happens when the target process has exited?
      A-01: Sending the signal will fail with ESRCH (cf. [22]).
      
      Q-02:  (Andrew Morton [21])
             Is the task_struct pinned by the fd?
      A-02:  No. A reference to struct pid is kept. struct pid - as far as I
             understand - was created exactly for the reason to not require to
             pin struct task_struct (cf. [22]).
      
      Q-03: (Andrew Morton [21])
            Does the entire procfs directory remain visible? Just one entry
            within it?
      A-03: The same thing that happens right now when you hold a file descriptor
            to /proc/<pid> open (cf. [22]).
      
      Q-04: (Andrew Morton [21])
            Does the pid remain reserved?
      A-04: No. This patchset guarantees a stable handle not that pids are not
            recycled (cf. [22]).
      
      Q-05: (Andrew Morton [21])
            Do attempts to signal that fd return errors?
      A-05: See {Q,A}-01.
      
      Q-06: (Andrew Morton [22])
            Is there a cleaner way of obtaining the fd? Another syscall perhaps.
      A-06: Userspace can already trivially retrieve file descriptors from procfs
            so this is something that we will need to support anyway. Hence,
            there's no immediate need to add another syscalls just to make
            pidfd_send_signal() not dependent on the presence of procfs. However,
            adding a syscalls to get such file descriptors is planned for a
            future patchset (cf. [22]).
      
      Q-07: (Andrew Morton [21] and others)
            This fd-for-a-process sounds like a handy thing and people may well
            think up other uses for it in the future, probably unrelated to
            signals. Are the code and the interface designed to permit such
            future applications?
      A-07: Yes (cf. [22]).
      
      Q-08: (Andrew Morton [21] and others)
            Now I think about it, why a new syscall? This thing is looking
            rather like an ioctl?
      A-08: This has been extensively discussed. It was agreed that a syscall is
            preferred for a variety or reasons. Here are just a few taken from
            prior threads. Syscalls are safer than ioctl()s especially when
            signaling to fds. Processes are a core kernel concept so a syscall
            seems more appropriate. The layout of the syscall with its four
            arguments would require the addition of a custom struct for the
            ioctl() thereby causing at least the same amount or even more
            complexity for userspace than a simple syscall. The new syscall will
            replace multiple other pid-based syscalls (see description above).
            The file-descriptors-for-processes concept introduced with this
            syscall will be extended with other syscalls in the future. See also
            [22], [23] and various other threads already linked in here.
      
      Q-09: (Florian Weimer [24])
            What happens if you use the new interface with an O_PATH descriptor?
      A-09:
            pidfds opened as O_PATH fds cannot be used to send signals to a
            process (cf. [2]). Signaling processes through pidfds is the
            equivalent of writing to a file. Thus, this is not an operation that
            operates "purely at the file descriptor level" as required by the
            open(2) manpage. See also [4].
      
      /* References */
      [1]:  https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
      [2]:  https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
      [3]:  https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
      [4]:  https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
      [5]:  https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
      [6]:  https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
      [7]:  https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
      [8]:  https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
      [9]:  https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
      [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
      [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
      [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
      [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
      [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
      [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
      [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
      [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
      [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
      [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
      [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
      [23]: https://lwn.net/Articles/773459/
      [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
      [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NTycho Andersen <tycho@tycho.ws>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NAleksa Sarai <cyphar@cyphar.com>
      3eb39f47
  12. 05 3月, 2019 3 次提交
    • S
      x86/ftrace: Fix warning and considate ftrace_jmp_replace() and ftrace_call_replace() · 745cfeaa
      Steven Rostedt (VMware) 提交于
      Arnd reported the following compiler warning:
      
      arch/x86/kernel/ftrace.c:669:23: error: 'ftrace_jmp_replace' defined but not used [-Werror=unused-function]
      
      The ftrace_jmp_replace() function now only has a single user and should be
      simply moved by that user. But looking at the code, it shows that
      ftrace_jmp_replace() is similar to ftrace_call_replace() except that instead
      of using the opcode of 0xe8 it uses 0xe9. It makes more sense to consolidate
      that function into one implementation that both ftrace_jmp_replace() and
      ftrace_call_replace() use by passing in the op code separate.
      
      The structure in ftrace_code_union is also modified to replace the "e8"
      field with the more appropriate name "op".
      
      Cc: stable@vger.kernel.org
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Link: http://lkml.kernel.org/r/20190304200748.1418790-1-arnd@arndb.de
      Fixes: d2a68c4e ("x86/ftrace: Do not call function graph from dynamic trampolines")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      745cfeaa
    • A
      xen: remove pre-xen3 fallback handlers · b1ddd406
      Arnd Bergmann 提交于
      The legacy hypercall handlers were originally added with
      a comment explaining that "copying the argument structures in
      HYPERVISOR_event_channel_op() and HYPERVISOR_physdev_op() into the local
      variable is sufficiently safe" and only made sure to not write
      past the end of the argument structure, the checks in linux/string.h
      disagree with that, when link-time optimizations are used:
      
      In function 'memcpy',
          inlined from 'pirq_query_unmask' at drivers/xen/fallback.c:53:2,
          inlined from '__startup_pirq' at drivers/xen/events/events_base.c:529:2,
          inlined from 'restore_pirqs' at drivers/xen/events/events_base.c:1439:3,
          inlined from 'xen_irq_resume' at drivers/xen/events/events_base.c:1581:2:
      include/linux/string.h:350:3: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter
         __read_overflow2();
         ^
      
      Further research turned out that only Xen 3.0.2 or earlier required the
      fallback at all, while all versions in use today don't need it.
      As far as I can tell, it is not even possible to run a mainline kernel
      on those old Xen releases, at the time when they were in use, only
      a patched kernel was supported anyway.
      
      Fixes: cf47a83f ("xen/hypercall: fix hypercall fallback code for very old hypervisors")
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      b1ddd406
    • L
      get rid of legacy 'get_ds()' function · 736706be
      Linus Torvalds 提交于
      Every in-kernel use of this function defined it to KERNEL_DS (either as
      an actual define, or as an inline function).  It's an entirely
      historical artifact, and long long long ago used to actually read the
      segment selector valueof '%ds' on x86.
      
      Which in the kernel is always KERNEL_DS.
      
      Inspired by a patch from Jann Horn that just did this for a very small
      subset of users (the ones in fs/), along with Al who suggested a script.
      I then just took it to the logical extreme and removed all the remaining
      gunk.
      
      Roughly scripted with
      
         git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
         git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'
      
      plus manual fixups to remove a few unusual usage patterns, the couple of
      inline function cases and to fix up a comment that had become stale.
      
      The 'get_ds()' function remains in an x86 kvm selftest, since in user
      space it actually does something relevant.
      Inspired-by: NJann Horn <jannh@google.com>
      Inspired-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      736706be