1. 15 5月, 2019 19 次提交
  2. 10 5月, 2019 2 次提交
  3. 08 5月, 2019 6 次提交
    • P
      x86/mm/tlb: Revert "x86/mm: Align TLB invalidation info" · 7a32cbf1
      Peter Zijlstra 提交于
      commit 780e0106d468a2962b16b52fdf42898f2639e0a0 upstream.
      
      Revert the following commit:
      
        515ab7c4: ("x86/mm: Align TLB invalidation info")
      
      I found out (the hard way) that under some .config options (notably L1_CACHE_SHIFT=7)
      and compiler combinations this on-stack alignment leads to a 320 byte
      stack usage, which then triggers a KASAN stack warning elsewhere.
      
      Using 320 bytes of stack space for a 40 byte structure is ludicrous and
      clearly not right.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NNadav Amit <namit@vmware.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 515ab7c4 ("x86/mm: Align TLB invalidation info")
      Link: http://lkml.kernel.org/r/20190416080335.GM7905@worktop.programming.kicks-ass.net
      [ Minor changelog edits. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a32cbf1
    • Q
      x86/mm: Fix a crash with kmemleak_scan() · c48b027f
      Qian Cai 提交于
      commit 0d02113b31b2017dd349ec9df2314e798a90fa6e upstream.
      
      The first kmemleak_scan() call after boot would trigger the crash below
      because this callpath:
      
        kernel_init
          free_initmem
            mem_encrypt_free_decrypted_mem
              free_init_pages
      
      unmaps memory inside the .bss when DEBUG_PAGEALLOC=y.
      
      kmemleak_init() will register the .data/.bss sections and then
      kmemleak_scan() will scan those addresses and dereference them looking
      for pointer references. If free_init_pages() frees and unmaps pages in
      those sections, kmemleak_scan() will crash if referencing one of those
      addresses:
      
        BUG: unable to handle kernel paging request at ffffffffbd402000
        CPU: 12 PID: 325 Comm: kmemleak Not tainted 5.1.0-rc4+ #4
        RIP: 0010:scan_block
        Call Trace:
         scan_gray_list
         kmemleak_scan
         kmemleak_scan_thread
         kthread
         ret_from_fork
      
      Since kmemleak_free_part() is tolerant to unknown objects (not tracked
      by kmemleak), it is fine to call it from free_init_pages() even if not
      all address ranges passed to this function are known to kmemleak.
      
       [ bp: Massage. ]
      
      Fixes: b3f0907c ("x86/mm: Add .bss..decrypted section to hold shared variables")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190423165811.36699-1-cai@lca.pwSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c48b027f
    • B
      x86/mm/KASLR: Fix the size of the direct mapping section · 052c78f5
      Baoquan He 提交于
      commit ec3937107ab43f3e8b2bc9dad95710043c462ff7 upstream.
      
      kernel_randomize_memory() uses __PHYSICAL_MASK_SHIFT to calculate
      the maximum amount of system RAM supported. The size of the direct
      mapping section is obtained from the smaller one of the below two
      values:
      
        (actual system RAM size + padding size) vs (max system RAM size supported)
      
      This calculation is wrong since commit
      
        b83ce5ee ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52").
      
      In it, __PHYSICAL_MASK_SHIFT was changed to be 52, regardless of whether
      the kernel is using 4-level or 5-level page tables. Thus, it will always
      use 4 PB as the maximum amount of system RAM, even in 4-level paging
      mode where it should actually be 64 TB.
      
      Thus, the size of the direct mapping section will always
      be the sum of the actual system RAM size plus the padding size.
      
      Even when the amount of system RAM is 64 TB, the following layout will
      still be used. Obviously KALSR will be weakened significantly.
      
         |____|_______actual RAM_______|_padding_|______the rest_______|
         0            64TB                                            ~120TB
      
      Instead, it should be like this:
      
         |____|_______actual RAM_______|_________the rest______________|
         0            64TB                                            ~120TB
      
      The size of padding region is controlled by
      CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING, which is 10 TB by default.
      
      The above issue only exists when
      CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING is set to a non-zero value,
      which is the case when CONFIG_MEMORY_HOTPLUG is enabled. Otherwise,
      using __PHYSICAL_MASK_SHIFT doesn't affect KASLR.
      
      Fix it by replacing __PHYSICAL_MASK_SHIFT with MAX_PHYSMEM_BITS.
      
       [ bp: Massage commit message. ]
      
      Fixes: b83ce5ee ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Garnier <thgarnie@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: frank.ramsay@hpe.com
      Cc: herbert@gondor.apana.org.au
      Cc: kirill@shutemov.name
      Cc: mike.travis@hpe.com
      Cc: thgarnie@google.com
      Cc: x86-ml <x86@kernel.org>
      Cc: yamada.masahiro@socionext.com
      Link: https://lkml.kernel.org/r/20190417083536.GE7065@MiWiFi-R3L-srvSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      052c78f5
    • T
      x86/mce: Improve error message when kernel cannot recover, p2 · 61ff4406
      Tony Luck 提交于
      commit 41f035a86b5b72a4f947c38e94239d20d595352a upstream.
      
      In
      
        c7d606f5 ("x86/mce: Improve error message when kernel cannot recover")
      
      a case was added for a machine check caused by a DATA access to poison
      memory from the kernel. A case should have been added also for an
      uncorrectable error during an instruction fetch in the kernel.
      
      Add that extra case so the error message now reads:
      
        mce: [Hardware Error]: Machine check: Instruction fetch error in kernel
      
      Fixes: c7d606f5 ("x86/mce: Improve error message when kernel cannot recover")
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190225205940.15226-1-tony.luck@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61ff4406
    • K
      perf/x86/amd: Update generic hardware cache events for Family 17h · 3f8497cf
      Kim Phillips 提交于
      commit 0e3b74e26280f2cf8753717a950b97d424da6046 upstream.
      
      Add a new amd_hw_cache_event_ids_f17h assignment structure set
      for AMD families 17h and above, since a lot has changed.  Specifically:
      
      L1 Data Cache
      
      The data cache access counter remains the same on Family 17h.
      
      For DC misses, PMCx041's definition changes with Family 17h,
      so instead we use the L2 cache accesses from L1 data cache
      misses counter (PMCx060,umask=0xc8).
      
      For DC hardware prefetch events, Family 17h breaks compatibility
      for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware
      Prefetch DC Fills."
      
      L1 Instruction Cache
      
      PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward
      compatible on Family 17h.
      
      For prefetches, we remove the erroneous PMCx04B assignment which
      counts how many software data cache prefetch load instructions were
      dispatched.
      
      LL - Last Level Cache
      
      Removing PMCs 7D, 7E, and 7F assignments, as they do not exist
      on Family 17h, where the last level cache is L3.  L3 counters
      can be accessed using the existing AMD Uncore driver.
      
      Data TLB
      
      On Intel machines, data TLB accesses ("dTLB-loads") are assigned
      to counters that count load/store instructions retired.  This
      is inconsistent with instruction TLB accesses, where Intel
      implementations report iTLB misses that hit in the STLB.
      
      Ideally, dTLB-loads would count higher level dTLB misses that hit
      in lower level TLBs, and dTLB-load-misses would report those
      that also missed in those lower-level TLBs, therefore causing
      a page table walk.  That would be consistent with instruction
      TLB operation, remove the redundancy between dTLB-loads and
      L1-dcache-loads, and prevent perf from producing artificially
      low percentage ratios, i.e. the "0.01%" below:
      
              42,550,869      L1-dcache-loads
              41,591,860      dTLB-loads
                   4,802      dTLB-load-misses          #    0.01% of all dTLB cache hits
               7,283,682      L1-dcache-stores
               7,912,392      dTLB-stores
                     310      dTLB-store-misses
      
      On AMD Families prior to 17h, the "Data Cache Accesses" counter is
      used, which is slightly better than load/store instructions retired,
      but still counts in terms of individual load/store operations
      instead of TLB operations.
      
      So, for AMD Families 17h and higher, this patch assigns "dTLB-loads"
      to a counter for L1 dTLB misses that hit in the L2 dTLB, and
      "dTLB-load-misses" to a counter for L1 DTLB misses that caused
      L2 DTLB misses and therefore also caused page table walks.  This
      results in a much more accurate view of data TLB performance:
      
              60,961,781      L1-dcache-loads
                   4,601      dTLB-loads
                     963      dTLB-load-misses          #   20.93% of all dTLB cache hits
      
      Note that for all AMD families, data loads and stores are combined
      in a single accesses counter, so no 'L1-dcache-stores' are reported
      separately, and stores are counted with loads in 'L1-dcache-loads'.
      
      Also note that the "% of all dTLB cache hits" string is misleading
      because (a) "dTLB cache": although TLBs can be considered caches for
      page tables, in this context, it can be misinterpreted as data cache
      hits because the figures are similar (at least on Intel), and (b) not
      all those loads (technically accesses) technically "hit" at that
      hardware level.  "% of all dTLB accesses" would be more clear/accurate.
      
      Instruction TLB
      
      On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the
      STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in
      the STLB and completed a page table walk.
      
      For AMD Family 17h and above, for 'iTLB-loads' we replace the
      erroneous instruction cache fetches counter with PMCx084
      "L1 ITLB Miss, L2 ITLB Hit".
      
      For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss,
      L2 ITLB Miss", but set a 0xff umask because without it the event
      does not get counted.
      
      Branch Predictor (BPU)
      
      PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families.
      
      Node Level Events
      
      Family 17h does not have a PMCx0e9 counter, and corresponding counters
      have not been made available publicly, so for now, we mark them as
      unsupported for Families 17h and above.
      
      Reference:
      
        "Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh"
        Released 7/17/2018, Publication #56255, Revision 3.03:
        https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf
      
      [ mingo: tidied up the line breaks. ]
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Cc: <stable@vger.kernel.org> # v4.9+
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Liška <mliska@suse.cz>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-perf-users@vger.kernel.org
      Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f8497cf
    • D
      KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflow · 82e8da1f
      David Rientjes 提交于
      [ Upstream commit b86bc2858b389255cd44555ce4b1e427b2b770c0 ]
      
      This ensures that the address and length provided to DBG_DECRYPT and
      DBG_ENCRYPT do not cause an overflow.
      
      At the same time, pass the actual number of pages pinned in memory to
      sev_unpin_memory() as a cleanup.
      Reported-by: NCfir Cohen <cfir@google.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      82e8da1f
  4. 05 5月, 2019 2 次提交
  5. 04 5月, 2019 2 次提交
    • R
      x86/mm: Don't exceed the valid physical address space · 6a364b2e
      Ralph Campbell 提交于
      [ Upstream commit 92c77f7c4d5dfaaf45b2ce19360e69977c264766 ]
      
      valid_phys_addr_range() is used to sanity check the physical address range
      of an operation, e.g., access to /dev/mem. It uses __pa(high_memory)
      internally.
      
      If memory is populated at the end of the physical address space, then
      __pa(high_memory) is outside of the physical address space because:
      
         high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
      
      For the comparison in valid_phys_addr_range() this is not an issue, but if
      CONFIG_DEBUG_VIRTUAL is enabled, __pa() maps to __phys_addr(), which
      verifies that the resulting physical address is within the valid physical
      address space of the CPU. So in the case that memory is populated at the
      end of the physical address space, this is not true and triggers a
      VIRTUAL_BUG_ON().
      
      Use __pa(high_memory - 1) to prevent the conversion from going beyond
      the end of valid physical addresses.
      
      Fixes: be62a320 ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Craig Bergstrom <craigb@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Cc: Sean Young <sean@mess.org>
      
      Link: https://lkml.kernel.org/r/20190326001817.15413-2-rcampbell@nvidia.comSigned-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
      6a364b2e
    • M
      x86/realmode: Don't leak the trampoline kernel address · fe71e625
      Matteo Croce 提交于
      [ Upstream commit b929a500d68479163c48739d809cbf4c1335db6f ]
      
      Since commit
      
        ad67b74d ("printk: hash addresses printed with %p")
      
      at boot "____ptrval____" is printed instead of the trampoline addresses:
      
        Base memory trampoline at [(____ptrval____)] 99000 size 24576
      
      Remove the print as we don't want to leak kernel addresses and this
      statement is not needed anymore.
      
      Fixes: ad67b74d ("printk: hash addresses printed with %p")
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190326203046.20787-1-mcroce@redhat.comSigned-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
      fe71e625
  6. 02 5月, 2019 4 次提交
    • S
      x86/fpu: Don't export __kernel_fpu_{begin,end}() · d4ff57d0
      Sebastian Andrzej Siewior 提交于
      commit 12209993e98c5fa1855c467f22a24e3d5b8be205 upstream.
      
      There is one user of __kernel_fpu_begin() and before invoking it,
      it invokes preempt_disable(). So it could invoke kernel_fpu_begin()
      right away. The 32bit version of arch_efi_call_virt_setup() and
      arch_efi_call_virt_teardown() does this already.
      
      The comment above *kernel_fpu*() claims that before invoking
      __kernel_fpu_begin() preemption should be disabled and that KVM is a
      good example of doing it. Well, KVM doesn't do that since commit
      
        f775b13e ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
      
      so it is not an example anymore.
      
      With EFI gone as the last user of __kernel_fpu_{begin|end}(), both can
      be made static and not exported anymore.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm ML <kvm@vger.kernel.org>
      Cc: linux-efi <linux-efi@vger.kernel.org>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20181129150210.2k4mawt37ow6c2vq@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4ff57d0
    • D
      x86/retpolines: Disable switch jump tables when retpolines are enabled · e923c6b7
      Daniel Borkmann 提交于
      commit a9d57ef15cbe327fe54416dd194ee0ea66ae53a4 upstream.
      
      Commit ce02ef06fcf7 ("x86, retpolines: Raise limit for generating indirect
      calls from switch-case") raised the limit under retpolines to 20 switch
      cases where gcc would only then start to emit jump tables, and therefore
      effectively disabling the emission of slow indirect calls in this area.
      
      After this has been brought to attention to gcc folks [0], Martin Liska
      has then fixed gcc to align with clang by avoiding to generate switch jump
      tables entirely under retpolines. This is taking effect in gcc starting
      from stable version 8.4.0. Given kernel supports compilation with older
      versions of gcc where the fix is not being available or backported anymore,
      we need to keep the extra KBUILD_CFLAGS around for some time and generally
      set the -fno-jump-tables to align with what more recent gcc is doing
      automatically today.
      
      More than 20 switch cases are not expected to be fast-path critical, but
      it would still be good to align with gcc behavior for versions < 8.4.0 in
      order to have consistency across supported gcc versions. vmlinux size is
      slightly growing by 0.27% for older gcc. This flag is only set to work
      around affected gcc, no change for clang.
      
        [0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952Suggested-by: NMartin Liska <mliska@suse.cz>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Björn Töpel<bjorn.topel@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/20190325135620.14882-1-daniel@iogearbox.netSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e923c6b7
    • D
      x86, retpolines: Raise limit for generating indirect calls from switch-case · 6cfcff3c
      Daniel Borkmann 提交于
      commit ce02ef06fcf7a399a6276adb83f37373d10cbbe1 upstream.
      
      From networking side, there are numerous attempts to get rid of indirect
      calls in fast-path wherever feasible in order to avoid the cost of
      retpolines, for example, just to name a few:
      
        * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
        * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
        * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
        * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
        * 09772d92 ("bpf: avoid retpoline for lookup/update/delete calls on maps")
        * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
        [...]
      
      Recent work on XDP from Björn and Magnus additionally found that manually
      transforming the XDP return code switch statement with more than 5 cases
      into if-else combination would result in a considerable speedup in XDP
      layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
      builds. On i40e driver with XDP prog attached, a 20-26% speedup has been
      observed [0]. Aside from XDP, there are many other places later in the
      networking stack's critical path with similar switch-case
      processing. Rather than fixing every XDP-enabled driver and locations in
      stack by hand, it would be good to instead raise the limit where gcc would
      emit expensive indirect calls from the switch under retpolines and stick
      with the default as-is in case of !retpoline configured kernels. This would
      also have the advantage that for archs where this is not necessary, we let
      compiler select the underlying target optimization for these constructs and
      avoid potential slow-downs by if-else hand-rewrite.
      
      In case of gcc, this setting is controlled by case-values-threshold which
      has an architecture global default that selects 4 or 5 (latter if target
      does not have a case insn that compares the bounds) where some arch back
      ends like arm64 or s390 override it with their own target hooks, for
      example, in gcc commit db7a90aa0de5 ("S/390: Disable prediction of indirect
      branches") the threshold pretty much disables jump tables by limit of 20
      under retpoline builds.  Comparing gcc's and clang's default code
      generation on x86-64 under O2 level with retpoline build results in the
      following outcome for 5 switch cases:
      
      * gcc with -mindirect-branch=thunk-inline -mindirect-branch-register:
      
        # gdb -batch -ex 'disassemble dispatch' ./c-switch
        Dump of assembler code for function dispatch:
         0x0000000000400be0 <+0>:     cmp    $0x4,%edi
         0x0000000000400be3 <+3>:     ja     0x400c35 <dispatch+85>
         0x0000000000400be5 <+5>:     lea    0x915f8(%rip),%rdx        # 0x4921e4
         0x0000000000400bec <+12>:    mov    %edi,%edi
         0x0000000000400bee <+14>:    movslq (%rdx,%rdi,4),%rax
         0x0000000000400bf2 <+18>:    add    %rdx,%rax
         0x0000000000400bf5 <+21>:    callq  0x400c01 <dispatch+33>
         0x0000000000400bfa <+26>:    pause
         0x0000000000400bfc <+28>:    lfence
         0x0000000000400bff <+31>:    jmp    0x400bfa <dispatch+26>
         0x0000000000400c01 <+33>:    mov    %rax,(%rsp)
         0x0000000000400c05 <+37>:    retq
         0x0000000000400c06 <+38>:    nopw   %cs:0x0(%rax,%rax,1)
         0x0000000000400c10 <+48>:    jmpq   0x400c90 <fn_3>
         0x0000000000400c15 <+53>:    nopl   (%rax)
         0x0000000000400c18 <+56>:    jmpq   0x400c70 <fn_2>
         0x0000000000400c1d <+61>:    nopl   (%rax)
         0x0000000000400c20 <+64>:    jmpq   0x400c50 <fn_1>
         0x0000000000400c25 <+69>:    nopl   (%rax)
         0x0000000000400c28 <+72>:    jmpq   0x400c40 <fn_0>
         0x0000000000400c2d <+77>:    nopl   (%rax)
         0x0000000000400c30 <+80>:    jmpq   0x400cb0 <fn_4>
         0x0000000000400c35 <+85>:    push   %rax
         0x0000000000400c36 <+86>:    callq  0x40dd80 <abort>
        End of assembler dump.
      
      * clang with -mretpoline emitting search tree:
      
        # gdb -batch -ex 'disassemble dispatch' ./c-switch
        Dump of assembler code for function dispatch:
         0x0000000000400b30 <+0>:     cmp    $0x1,%edi
         0x0000000000400b33 <+3>:     jle    0x400b44 <dispatch+20>
         0x0000000000400b35 <+5>:     cmp    $0x2,%edi
         0x0000000000400b38 <+8>:     je     0x400b4d <dispatch+29>
         0x0000000000400b3a <+10>:    cmp    $0x3,%edi
         0x0000000000400b3d <+13>:    jne    0x400b52 <dispatch+34>
         0x0000000000400b3f <+15>:    jmpq   0x400c50 <fn_3>
         0x0000000000400b44 <+20>:    test   %edi,%edi
         0x0000000000400b46 <+22>:    jne    0x400b5c <dispatch+44>
         0x0000000000400b48 <+24>:    jmpq   0x400c20 <fn_0>
         0x0000000000400b4d <+29>:    jmpq   0x400c40 <fn_2>
         0x0000000000400b52 <+34>:    cmp    $0x4,%edi
         0x0000000000400b55 <+37>:    jne    0x400b66 <dispatch+54>
         0x0000000000400b57 <+39>:    jmpq   0x400c60 <fn_4>
         0x0000000000400b5c <+44>:    cmp    $0x1,%edi
         0x0000000000400b5f <+47>:    jne    0x400b66 <dispatch+54>
         0x0000000000400b61 <+49>:    jmpq   0x400c30 <fn_1>
         0x0000000000400b66 <+54>:    push   %rax
         0x0000000000400b67 <+55>:    callq  0x40dd20 <abort>
        End of assembler dump.
      
        For sake of comparison, clang without -mretpoline:
      
        # gdb -batch -ex 'disassemble dispatch' ./c-switch
        Dump of assembler code for function dispatch:
         0x0000000000400b30 <+0>:	cmp    $0x4,%edi
         0x0000000000400b33 <+3>:	ja     0x400b57 <dispatch+39>
         0x0000000000400b35 <+5>:	mov    %edi,%eax
         0x0000000000400b37 <+7>:	jmpq   *0x492148(,%rax,8)
         0x0000000000400b3e <+14>:	jmpq   0x400bf0 <fn_0>
         0x0000000000400b43 <+19>:	jmpq   0x400c30 <fn_4>
         0x0000000000400b48 <+24>:	jmpq   0x400c10 <fn_2>
         0x0000000000400b4d <+29>:	jmpq   0x400c20 <fn_3>
         0x0000000000400b52 <+34>:	jmpq   0x400c00 <fn_1>
         0x0000000000400b57 <+39>:	push   %rax
         0x0000000000400b58 <+40>:	callq  0x40dcf0 <abort>
        End of assembler dump.
      
      Raising the cases to a high number (e.g. 100) will still result in similar
      code generation pattern with clang and gcc as above, in other words clang
      generally turns off jump table emission by having an extra expansion pass
      under retpoline build to turn indirectbr instructions from their IR into
      switch instructions as a built-in -mno-jump-table lowering of a switch (in
      this case, even if IR input already contained an indirect branch).
      
      For gcc, adding --param=case-values-threshold=20 as in similar fashion as
      s390 in order to raise the limit for x86 retpoline enabled builds results
      in a small vmlinux size increase of only 0.13% (before=18,027,528
      after=18,051,192). For clang this option is ignored due to i) not being
      needed as mentioned and ii) not having above cmdline
      parameter. Non-retpoline-enabled builds with gcc continue to use the
      default case-values-threshold setting, so nothing changes here.
      
      [0] https://lore.kernel.org/netdev/20190129095754.9390-1-bjorn.topel@gmail.com/
          and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track:
        - http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf
        - http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: netdev@vger.kernel.org
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: https://lkml.kernel.org/r/20190221221941.29358-1-daniel@iogearbox.netSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6cfcff3c
    • H
      perf/x86/intel: Update KBL Package C-state events to also include PC8/PC9/PC10 counters · 37ecf31a
      Harry Pan 提交于
      commit 82c99f7a81f28f8c1be5f701c8377d14c4075b10 upstream.
      
      Kaby Lake (and Coffee Lake) has PC8/PC9/PC10 residency counters.
      
      This patch updates the list of Kaby/Coffee Lake PMU event counters
      from the snb_cstates[] list of events to the hswult_cstates[]
      list of events, which keeps all previously supported events and
      also adds the PKG_C8, PKG_C9 and PKG_C10 residency counters.
      
      This allows user space tools to profile them through the perf interface.
      Signed-off-by: NHarry Pan <harry.pan@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: gs0622@gmail.com
      Link: http://lkml.kernel.org/r/20190424145033.1924-1-harry.pan@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37ecf31a
  7. 27 4月, 2019 5 次提交