1. 07 11月, 2018 1 次提交
    • K
      USB: Wait for extra delay time after USB_PORT_FEAT_RESET for quirky hub · 781f0766
      Kai-Heng Feng 提交于
      Devices connected under Terminus Technology Inc. Hub (1a40:0101) may
      fail to work after the system resumes from suspend:
      [  206.063325] usb 3-2.4: reset full-speed USB device number 4 using xhci_hcd
      [  206.143691] usb 3-2.4: device descriptor read/64, error -32
      [  206.351671] usb 3-2.4: device descriptor read/64, error -32
      
      Info for this hub:
      T:  Bus=03 Lev=01 Prnt=01 Port=01 Cnt=01 Dev#=  2 Spd=480 MxCh= 4
      D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=01 MxPS=64 #Cfgs=  1
      P:  Vendor=1a40 ProdID=0101 Rev=01.11
      S:  Product=USB 2.0 Hub
      C:  #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA
      I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
      
      Some expirements indicate that the USB devices connected to the hub are
      innocent, it's the hub itself is to blame. The hub needs extra delay
      time after it resets its port.
      
      Hence wait for extra delay, if the device is connected to this quirky
      hub.
      Signed-off-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
      Cc: stable <stable@vger.kernel.org>
      Acked-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      781f0766
  2. 27 10月, 2018 1 次提交
    • A
      mm: provide kernel parameter to allow disabling page init poisoning · f682a97a
      Alexander Duyck 提交于
      Patch series "Address issues slowing persistent memory initialization", v5.
      
      The main thing this patch set achieves is that it allows us to initialize
      each node worth of persistent memory independently.  As a result we reduce
      page init time by about 2 minutes because instead of taking 30 to 40
      seconds per node and going through each node one at a time, we process all
      4 nodes in parallel in the case of a 12TB persistent memory setup spread
      evenly over 4 nodes.
      
      This patch (of 3):
      
      On systems with a large amount of memory it can take a significant amount
      of time to initialize all of the page structs with the PAGE_POISON_PATTERN
      value.  I have seen it take over 2 minutes to initialize a system with
      over 12TB of RAM.
      
      In order to work around the issue I had to disable CONFIG_DEBUG_VM and
      then the boot time returned to something much more reasonable as the
      arch_add_memory call completed in milliseconds versus seconds.  However in
      doing that I had to disable all of the other VM debugging on the system.
      
      In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
      on a system that has a large amount of memory I have added a new kernel
      parameter named "vm_debug" that can be set to "-" in order to disable it.
      
      Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomainReviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f682a97a
  3. 11 10月, 2018 1 次提交
  4. 09 10月, 2018 1 次提交
  5. 03 10月, 2018 3 次提交
  6. 02 10月, 2018 1 次提交
    • A
      perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler · af3bdb99
      Andi Kleen 提交于
      Implements counter freezing for Arch Perfmon v4 (Skylake and
      newer). This allows to speed up the PMI handler by avoiding
      unnecessary MSR writes and make it more accurate.
      
      The Arch Perfmon v4 PMI handler is substantially different than
      the older PMI handler.
      
      Differences to the old handler:
      
      - It relies on counter freezing, which eliminates several MSR
        writes from the PMI handler and lowers the overhead significantly.
      
        It makes the PMI handler more accurate, as all counters get
        frozen atomically as soon as any counter overflows. So there is
        much less counting of the PMI handler itself.
      
        With the freezing we don't need to disable or enable counters or
        PEBS. Only BTS which does not support auto-freezing still needs to
        be explicitly managed.
      
      - The PMU acking is done at the end, not the beginning.
        This makes it possible to avoid manual enabling/disabling
        of the PMU, instead we just rely on the freezing/acking.
      
      - The APIC is acked before reenabling the PMU, which avoids
        problems with LBRs occasionally not getting unfreezed on Skylake.
      
      - Looping is only needed to workaround a corner case which several PMIs
        are very close to each other. For common cases, the counters are freezed
        during PMI handler. It doesn't need to do re-check.
      
      This patch:
      
      - Adds code to enable v4 counter freezing
      - Fork <=v3 and >=v4 PMI handlers into separate functions.
      - Add kernel parameter to disable counter freezing. It took some time to
        debug counter freezing, so in case there are new problems we added an
        option to turn it off. Would not expect this to be used until there
        are new bugs.
      - Only for big core. The patch for small core will be posted later
        separately.
      
      Performance:
      
      When profiling a kernel build on Kabylake with different perf options,
      measuring the length of all NMI handlers using the nmi handler
      trace point:
      
      V3 is without counter freezing.
      V4 is with counter freezing.
      The value is the average cost of the PMI handler.
      (lower is better)
      
      perf options    `           V3(ns) V4(ns)  delta
      -c 100000                   1088   894     -18%
      -g -c 100000                1862   1646    -12%
      --call-graph lbr -c 100000  3649   3367    -8%
      --c.g. dwarf -c 100000      2248   1982    -12%
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Link: http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      af3bdb99
  7. 01 10月, 2018 1 次提交
  8. 21 9月, 2018 1 次提交
  9. 14 9月, 2018 1 次提交
    • M
      xen/balloon: add runtime control for scrubbing ballooned out pages · 197ecb38
      Marek Marczykowski-Górecki 提交于
      Scrubbing pages on initial balloon down can take some time, especially
      in nested virtualization case (nested EPT is slow). When HVM/PVH guest is
      started with memory= significantly lower than maxmem=, all the extra
      pages will be scrubbed before returning to Xen. But since most of them
      weren't used at all at that point, Xen needs to populate them first
      (from populate-on-demand pool). In nested virt case (Xen inside KVM)
      this slows down the guest boot by 15-30s with just 1.5GB needed to be
      returned to Xen.
      
      Add runtime parameter to enable/disable it, to allow initially disabling
      scrubbing, then enable it back during boot (for example in initramfs).
      Such usage relies on assumption that a) most pages ballooned out during
      initial boot weren't used at all, and b) even if they were, very few
      secrets are in the guest at that time (before any serious userspace
      kicks in).
      Convert CONFIG_XEN_SCRUB_PAGES to CONFIG_XEN_SCRUB_PAGES_DEFAULT (also
      enabled by default), controlling default value for the new runtime
      switch.
      Signed-off-by: NMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      197ecb38
  10. 02 9月, 2018 1 次提交
  11. 31 8月, 2018 3 次提交
    • P
      rcu: Compute jiffies_till_sched_qs from other kernel parameters · c06aed0e
      Paul E. McKenney 提交于
      The jiffies_till_sched_qs value used to determine how old a grace period
      must be before RCU enlists the help of the scheduler to force a quiescent
      state on the holdout CPU.  Currently, this defaults to HZ/10 regardless of
      system size and may be set only at boot time.  This can be a problem for
      very large systems, because if the values of the jiffies_till_first_fqs
      and jiffies_till_next_fqs kernel parameters are left at their defaults,
      they are calculated to increase as the number of CPUs actually configured
      on the system increases.  Thus, on a sufficiently large system, RCU would
      enlist the help of the scheduler before the grace-period kthread had a
      chance to scan for idle CPUs, which wastes CPU time.
      
      This commit therefore allows jiffies_till_sched_qs to be set, if desired,
      but if left as default, computes is as jiffies_till_first_fqs plus twice
      jiffies_till_next_fqs, thus allowing three force-quiescent-state scans
      for idle CPUs.  This scales with the number of CPUs, providing sensible
      default values.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c06aed0e
    • P
      rcu: Stop testing RCU-bh and RCU-sched · 72ce30dd
      Paul E. McKenney 提交于
      Now that the RCU-bh and RCU-sched update-side functions are simple
      wrappers around their RCU counterparts, there isn't a whole lot of
      point in testing them.  This commit therefore removes the self-test
      capability and removes the corresponding kernel-boot parameters.
      It also updates the various rcutorture .boot files to remove the
      kernel boot parameters that call for testing RCU-bh and RCU-sched.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      72ce30dd
    • P
      doc: Update removal of RCU-bh/sched update machinery · 77095901
      Paul E. McKenney 提交于
      The RCU-bh update API is now defined in terms of that of RCU-bh and
      RCU-sched, so this commit updates the documentation accordingly.
      
      In addition, although RCU-sched persists in !PREEMPT kernels, in
      the PREEMPT case its update API is now defined in terms of that of
      RCU-preempt, so this commit also updates the documentation accordingly.
      
      While in the area, this commit removes the documentation for the
      now-obsolete synchronize_rcu_mult() and clarifies the Tasks RCU
      documentation.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      77095901
  12. 23 8月, 2018 1 次提交
  13. 10 8月, 2018 3 次提交
    • L
      PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support · aaca43fd
      Logan Gunthorpe 提交于
      To support peer-to-peer traffic on a segment of the PCI hierarchy, we must
      disable the ACS redirect bits for select PCI bridges.  The bridges must be
      selected before the devices are discovered by the kernel and the IOMMU
      groups created.  Therefore, add a kernel command line parameter to specify
      devices which must have their ACS bits disabled.
      
      The new parameter takes a list of devices separated by a semicolon.  Each
      device specified will have its ACS redirect bits disabled.  This is
      similar to the existing 'resource_alignment' parameter.
      
      The ACS Request P2P Request Redirect, P2P Completion Redirect and P2P
      Egress Control bits are disabled, which is sufficient to always allow
      passing P2P traffic uninterrupted.  The bits are set after the kernel
      (optionally) enables the ACS bits itself.  It is also done regardless of
      whether the kernel or platform firmware sets the bits.
      
      If the user tries to disable the ACS redirect for a device without the ACS
      capability, print a warning to dmesg.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      [bhelgaas: reorder to add the generic code first and move the
      device-specific quirk to subsequent patches]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NStephen Bates <sbates@raithlin.com>
      Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      aaca43fd
    • L
      PCI: Allow specifying devices using a base bus and path of devfns · 45db3370
      Logan Gunthorpe 提交于
      When specifying PCI devices on the kernel command line using a
      bus/device/function address, bus numbers can change when adding or
      replacing a device, changing motherboard firmware, or applying kernel
      parameters like "pci=assign-buses".  When bus numbers change, it's likely
      the command line tweak will be applied to the wrong device.
      
      Therefore, it is useful to be able to specify devices with a base bus
      number and the path of devfns needed to get to it, similar to the "device
      scope" structure in the Intel VT-d spec, Section 8.3.1.
      
      Thus, we add an option to specify devices in the following format:
      
        [<domain>:]<bus>:<device>.<func>[/<device>.<func>]*
      
      The path can be any segment within the PCI hierarchy of any length and
      determined through the use of 'lspci -t'.  When specified this way, it is
      less likely that a renumbered bus will result in a valid device
      specification and the tweak won't be applied to the wrong device.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      [bhelgaas: use "device" instead of "slot" in documentation since that's the
      usual language in the PCI specs]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NStephen Bates <sbates@raithlin.com>
      Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      45db3370
    • L
      PCI: Make specifying PCI devices in kernel parameters reusable · 07d8d7e5
      Logan Gunthorpe 提交于
      Separate out the code to match a PCI device with a string (typically
      originating from a kernel parameter) from the
      pci_specified_resource_alignment() function into its own helper function.
      
      While we are at it, this change fixes the kernel style of the function
      (fixing a number of long lines and extra parentheses).
      
      Additionally, make the analogous change to the kernel parameter
      documentation: Separate the description of how to specify a PCI device
      into its own section at the head of the "pci=" parameter.
      
      This patch should have no functional alterations.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      [bhelgaas: use "device" instead of "slot" in documentation since that's the
      usual language in the PCI specs]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NStephen Bates <sbates@raithlin.com>
      Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      07d8d7e5
  14. 07 8月, 2018 1 次提交
  15. 27 7月, 2018 1 次提交
    • O
      iommu: Add config option to set passthrough as default · 58d11317
      Olof Johansson 提交于
      This allows the default behavior to be controlled by a kernel config
      option instead of changing the commandline for the kernel to include
      "iommu.passthrough=on" or "iommu=pt" on machines where this is desired.
      
      Likewise, for machines where this config option is enabled, it can be
      disabled at boot time with "iommu.passthrough=off" or "iommu=nopt".
      
      Also corrected iommu=pt documentation for IA-64, since it has no code that
      parses iommu= at all.
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      58d11317
  16. 20 7月, 2018 1 次提交
    • P
      x86/tsc: Redefine notsc to behave as tsc=unstable · fe9af81e
      Pavel Tatashin 提交于
      Currently, the notsc kernel parameter disables the use of the TSC by
      sched_clock(). However, this parameter does not prevent the kernel from
      accessing tsc in other places.
      
      The only rationale to boot with notsc is to avoid timing discrepancies on
      multi-socket systems where TSC are not properly synchronized, and thus
      exclude TSC from being used for time keeping. But that prevents using TSC
      as sched_clock() as well, which is not necessary as the core sched_clock()
      implementation can handle non synchronized TSC based sched clocks just
      fine.
      
      However, there is another method to solve the above problem: booting with
      tsc=unstable parameter. This parameter allows sched_clock() to use TSC and
      just excludes it from timekeeping.
      
      So there is no real reason to keep notsc, but for compatibility reasons the
      parameter has to stay. Make it behave like 'tsc=unstable' instead.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-12-pasha.tatashin@oracle.com
      fe9af81e
  17. 18 7月, 2018 1 次提交
  18. 13 7月, 2018 2 次提交
    • J
      x86/bugs, kvm: Introduce boot-time control of L1TF mitigations · d90a7a0e
      Jiri Kosina 提交于
      Introduce the 'l1tf=' kernel command line option to allow for boot-time
      switching of mitigation that is used on processors affected by L1TF.
      
      The possible values are:
      
        full
      	Provides all available mitigations for the L1TF vulnerability. Disables
      	SMT and enables all mitigations in the hypervisors. SMT control via
      	/sys/devices/system/cpu/smt/control is still possible after boot.
      	Hypervisors will issue a warning when the first VM is started in
      	a potentially insecure configuration, i.e. SMT enabled or L1D flush
      	disabled.
      
        full,force
      	Same as 'full', but disables SMT control. Implies the 'nosmt=force'
      	command line option. sysfs control of SMT and the hypervisor flush
      	control is disabled.
      
        flush
      	Leaves SMT enabled and enables the conditional hypervisor mitigation.
      	Hypervisors will issue a warning when the first VM is started in a
      	potentially insecure configuration, i.e. SMT enabled or L1D flush
      	disabled.
      
        flush,nosmt
      	Disables SMT and enables the conditional hypervisor mitigation. SMT
      	control via /sys/devices/system/cpu/smt/control is still possible
      	after boot. If SMT is reenabled or flushing disabled at runtime
      	hypervisors will issue a warning.
      
        flush,nowarn
      	Same as 'flush', but hypervisors will not warn when
      	a VM is started in a potentially insecure configuration.
      
        off
      	Disables hypervisor mitigations and doesn't emit any warnings.
      
      Default is 'flush'.
      
      Let KVM adhere to these semantics, which means:
      
        - 'lt1f=full,force'	: Performe L1D flushes. No runtime control
          			  possible.
      
        - 'l1tf=full'
        - 'l1tf-flush'
        - 'l1tf=flush,nosmt'	: Perform L1D flushes and warn on VM start if
      			  SMT has been runtime enabled or L1D flushing
      			  has been run-time enabled
      			  
        - 'l1tf=flush,nowarn'	: Perform L1D flushes and no warnings are emitted.
        
        - 'l1tf=off'		: L1D flushes are not performed and no warnings
      			  are emitted.
      
      KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
      module parameter except when lt1f=full,force is set.
      
      This makes KVM's private 'nosmt' option redundant, and as it is a bit
      non-systematic anyway (this is something to control globally, not on
      hypervisor level), remove that option.
      
      Add the missing Documentation entry for the l1tf vulnerability sysfs file
      while at it.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJiri Kosina <jkosina@suse.cz>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
      d90a7a0e
    • P
      rcutorture: Change units of onoff_interval to jiffies · 028be12b
      Paul E. McKenney 提交于
      Some RCU bugs have been sensitive to the frequency of CPU-hotplug
      operations, which have been gradually increased over time.  But this
      frequency is now at the one-second lower limit that can be specified using
      the rcutorture.onoff_interval kernel parameter.  This commit therefore
      changes the units of rcutorture.onoff_interval from seconds to jiffies,
      and also sets the value specified for this kernel parameter in the TREE03
      rcutorture scenario to 200, which is 200 milliseconds for HZ=1000.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      028be12b
  19. 11 7月, 2018 2 次提交
  20. 10 7月, 2018 1 次提交
    • R
      driver core: allow stopping deferred probe after init · 25b4e70d
      Rob Herring 提交于
      Deferred probe will currently wait forever on dependent devices to probe,
      but sometimes a driver will never exist. It's also not always critical for
      a driver to exist. Platforms can rely on default configuration from the
      bootloader or reset defaults for things such as pinctrl and power domains.
      This is often the case with initial platform support until various drivers
      get enabled. There's at least 2 scenarios where deferred probe can render
      a platform broken. Both involve using a DT which has more devices and
      dependencies than the kernel supports. The 1st case is a driver may be
      disabled in the kernel config. The 2nd case is the kernel version may
      simply not have the dependent driver. This can happen if using a newer DT
      (provided by firmware perhaps) with a stable kernel version. Deferred
      probe issues can be difficult to debug especially if the console has
      dependencies or userspace fails to boot to a shell.
      
      There are also cases like IOMMUs where only built-in drivers are
      supported, so deferring probe after initcalls is not needed. The IOMMU
      subsystem implemented its own mechanism to handle this using OF_DECLARE
      linker sections.
      
      This commit adds makes ending deferred probe conditional on initcalls
      being completed or a debug timeout. Subsystems or drivers may opt-in by
      calling driver_deferred_probe_check_init_done() instead of
      unconditionally returning -EPROBE_DEFER. They may use additional
      information from DT or kernel's config to decide whether to continue to
      defer probe or not.
      
      The timeout mechanism is intended for debug purposes and WARNs loudly.
      The remaining deferred probe pending list will also be dumped after the
      timeout. Not that this timeout won't work for the console which needs
      to be enabled before userspace starts. However, if the console's
      dependencies are resolved, then the kernel log will be printed (as
      opposed to no output).
      
      Cc: Alexander Graf <agraf@suse.de>
      Signed-off-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25b4e70d
  21. 06 7月, 2018 1 次提交
  22. 05 7月, 2018 2 次提交
    • K
      x86/KVM/VMX: Add module argument for L1TF mitigation · a399477e
      Konrad Rzeszutek Wilk 提交于
      Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
      L1 terminal fault. The valid arguments are:
      
       - "always" 	L1D cache flush on every VMENTER.
       - "cond"	Conditional L1D cache flush, explained below
       - "never"	Disable the L1D cache flush mitigation
      
      "cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
      between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
      interesting information into L1D which might exploited.
      
      [ tglx: Split out from a larger patch ]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      a399477e
    • K
      x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present · 26acfb66
      Konrad Rzeszutek Wilk 提交于
      If the L1TF CPU bug is present we allow the KVM module to be loaded as the
      major of users that use Linux and KVM have trusted guests and do not want a
      broken setup.
      
      Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
      such they are the ones that should set nosmt to one.
      
      Setting 'nosmt' means that the system administrator also needs to disable
      SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
      parameter, or via the /sys/devices/system/cpu/smt/control. See commit
      05736e4a ("cpu/hotplug: Provide knobs to control SMT").
      
      Other mitigations are to use task affinity, cpu sets, interrupt binding,
      etc - anything to make sure that _only_ the same guests vCPUs are running
      on sibling threads.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      26acfb66
  23. 04 7月, 2018 1 次提交
    • C
      usercopy: Allow boot cmdline disabling of hardening · b5cb15d9
      Chris von Recklinghausen 提交于
      Enabling HARDENED_USERCOPY may cause measurable regressions in networking
      performance: up to 8% under UDP flood.
      
      I ran a small packet UDP flood using pktgen vs. a host b2b connected. On
      the receiver side the UDP packets are processed by a simple user space
      process that just reads and drops them:
      
      https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
      
      Not very useful from a functional PoV, but it helps to pin-point
      bottlenecks in the networking stack.
      
      When running a kernel with CONFIG_HARDENED_USERCOPY=y, I see a 5-8%
      regression in the receive tput, compared to the same kernel without this
      option enabled.
      
      With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
      cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).
      
      The call-chain is:
      
      __GI___libc_recvfrom
      entry_SYSCALL_64_after_hwframe
      do_syscall_64
      __x64_sys_recvfrom
      __sys_recvfrom
      inet_recvmsg
      udp_recvmsg
      __check_object_size
      
      udp_recvmsg() actually calls copy_to_iter() (inlined) and the latters
      calls check_copy_size() (again, inlined).
      
      A generic distro may want to enable HARDENED_USERCOPY in their default
      kernel config, but at the same time, such distro may want to be able to
      avoid the performance penalties in with the default configuration and
      disable the stricter check on a per-boot basis.
      
      This change adds a boot parameter that conditionally disables
      HARDENED_USERCOPY via "hardened_usercopy=off".
      Signed-off-by: NChris von Recklinghausen <crecklin@redhat.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      b5cb15d9
  24. 02 7月, 2018 1 次提交
    • T
      Revert "x86/apic: Ignore secondary threads if nosmt=force" · 506a66f3
      Thomas Gleixner 提交于
      Dave Hansen reported, that it's outright dangerous to keep SMT siblings
      disabled completely so they are stuck in the BIOS and wait for SIPI.
      
      The reason is that Machine Check Exceptions are broadcasted to siblings and
      the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
      logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
      reboots the machine. The MCE chapter in the SDM contains the following
      blurb:
      
          Because the logical processors within a physical package are tightly
          coupled with respect to shared hardware resources, both logical
          processors are notified of machine check errors that occur within a
          given physical processor. If machine-check exceptions are enabled when
          a fatal error is reported, all the logical processors within a physical
          package are dispatched to the machine-check exception handler. If
          machine-check exceptions are disabled, the logical processors enter the
          shutdown state and assert the IERR# signal. When enabling machine-check
          exceptions, the MCE flag in control register CR4 should be set for each
          logical processor.
      
      Reverting the commit which ignores siblings at enumeration time solves only
      half of the problem. The core cpuhotplug logic needs to be adjusted as
      well.
      
      This thoughtful engineered mechanism also turns the boot process on all
      Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
      before the secondary CPUs are brought up. Depending on the number of
      physical cores the window in which this situation can happen is smaller or
      larger. On a HSW-EX it's about 750ms:
      
      MCE is enabled on the boot CPU:
      
      [    0.244017] mce: CPU supports 22 MCE banks
      
      The corresponding sibling #72 boots:
      
      [    1.008005] .... node  #0, CPUs:    #72
      
      That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
      between these two points the machine is going to shutdown. At least it's a
      known safe state.
      
      It's obvious that the early boot can be hit by an MCE as well and then runs
      into the same situation because MCEs are not yet enabled on the boot CPU.
      But after enabling them on the boot CPU, it does not make any sense to
      prevent the kernel from recovering.
      
      Adjust the nosmt kernel parameter documentation as well.
      
      Reverts: 2207def7 ("x86/apic: Ignore secondary threads if nosmt=force")
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NTony Luck <tony.luck@intel.com>
      506a66f3
  25. 30 6月, 2018 1 次提交
  26. 21 6月, 2018 1 次提交
    • T
      cpu/hotplug: Provide knobs to control SMT · 05736e4a
      Thomas Gleixner 提交于
      Provide a command line and a sysfs knob to control SMT.
      
      The command line options are:
      
       'nosmt':	Enumerate secondary threads, but do not online them
       		
       'nosmt=force': Ignore secondary threads completely during enumeration
       		via MP table and ACPI/MADT.
      
      The sysfs control file has the following states (read/write):
      
       'on':		 SMT is enabled. Secondary threads can be freely onlined
       'off':		 SMT is disabled. Secondary threads, even if enumerated
       		 cannot be onlined
       'forceoff':	 SMT is permanentely disabled. Writes to the control
       		 file are rejected.
       'notsupported': SMT is not supported by the CPU
      
      The command line option 'nosmt' sets the sysfs control to 'off'. This
      can be changed to 'on' to reenable SMT during runtime.
      
      The command line option 'nosmt=force' sets the sysfs control to
      'forceoff'. This cannot be changed during runtime.
      
      When SMT is 'on' and the control file is changed to 'off' then all online
      secondary threads are offlined and attempts to online a secondary thread
      later on are rejected.
      
      When SMT is 'off' and the control file is changed to 'on' then secondary
      threads can be onlined again. The 'off' -> 'on' transition does not
      automatically online the secondary threads.
      
      When the control file is set to 'forceoff', the behaviour is the same as
      setting it to 'off', but the operation is irreversible and later writes to
      the control file are rejected.
      
      When the control status is 'notsupported' then writes to the control file
      are rejected.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      05736e4a
  27. 16 6月, 2018 2 次提交
  28. 01 6月, 2018 1 次提交
  29. 29 5月, 2018 1 次提交
  30. 28 5月, 2018 1 次提交