1. 24 3月, 2020 1 次提交
  2. 14 3月, 2020 1 次提交
  3. 11 3月, 2020 1 次提交
  4. 06 3月, 2020 1 次提交
  5. 05 3月, 2020 2 次提交
  6. 03 3月, 2020 2 次提交
  7. 25 2月, 2020 1 次提交
  8. 21 2月, 2020 4 次提交
    • P
      torture: Allow disabling of boottime CPU-hotplug torture operations · 8171d3e0
      Paul E. McKenney 提交于
      In theory, RCU-hotplug operations are supposed to work as soon as there
      is more than one CPU online.  However, in practice, in normal production
      there is no way to make them happen until userspace is up and running.
      Besides which, on smaller systems, rcutorture doesn't start doing hotplug
      operations until 30 seconds after the start of boot, which on most
      systems also means the better part of 30 seconds after the end of boot.
      This commit therefore provides a new torture.disable_onoff_at_boot kernel
      boot parameter that suppresses CPU-hotplug torture operations until
      about the time that init is spawned.
      
      Of course, if you know of a need for boottime CPU-hotplug operations,
      then you should avoid passing this argument to any of the torture tests.
      You might also want to look at the splats linked to below.
      
      Link: https://lore.kernel.org/lkml/20191206185208.GA25636@paulmck-ThinkPad-P72/Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8171d3e0
    • P
      rcutorture: Allow boottime stall warnings to be suppressed · 58c53360
      Paul E. McKenney 提交于
      In normal production, an RCU CPU stall warning at boottime is often
      just as bad as at any other time.  In fact, given the desire for fast
      boot, any sort of long-term stall at boot is a bad idea.  However,
      heavy rcutorture testing on large hyperthreaded systems can generate
      boottime RCU CPU stalls as a matter of course.  This commit therefore
      provides a kernel boot parameter that suppresses reporting of boottime
      RCU CPU stall warnings and similarly of rcutorture writer stalls.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      58c53360
    • P
      rcu: React to callback overload by aggressively seeking quiescent states · b2b00ddf
      Paul E. McKenney 提交于
      In default configutions, RCU currently waits at least 100 milliseconds
      before asking cond_resched() and/or resched_rcu() for help seeking
      quiescent states to end a grace period.  But 100 milliseconds can be
      one good long time during an RCU callback flood, for example, as can
      happen when user processes repeatedly open and close files in a tight
      loop.  These 100-millisecond gaps in successive grace periods during a
      callback flood can result in excessive numbers of callbacks piling up,
      unnecessarily increasing memory footprint.
      
      This commit therefore asks cond_resched() and/or resched_rcu() for help
      as early as the first FQS scan when at least one of the CPUs has more
      than 20,000 callbacks queued, a number that can be changed using the new
      rcutree.qovld kernel boot parameter.  An auxiliary qovld_calc variable
      is used to avoid acquisition of locks that have not yet been initialized.
      Early tests indicate that this reduces the RCU-callback memory footprint
      during rcutorture floods by from 50% to 4x, depending on configuration.
      Reported-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      [ paulmck: Fix bug located by Qian Cai. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      Tested-by: NDexuan Cui <decui@microsoft.com>
      Tested-by: NQian Cai <cai@lca.pw>
      b2b00ddf
    • P
      x86/split_lock: Enable split lock detection by kernel · 6650cdd9
      Peter Zijlstra (Intel) 提交于
      A split-lock occurs when an atomic instruction operates on data that spans
      two cache lines. In order to maintain atomicity the core takes a global bus
      lock.
      
      This is typically >1000 cycles slower than an atomic operation within a
      cache line. It also disrupts performance on other cores (which must wait
      for the bus lock to be released before their memory operations can
      complete). For real-time systems this may mean missing deadlines. For other
      systems it may just be very annoying.
      
      Some CPUs have the capability to raise an #AC trap when a split lock is
      attempted.
      
      Provide a command line option to give the user choices on how to handle
      this:
      
      split_lock_detect=
      	off	- not enabled (no traps for split locks)
      	warn	- warn once when an application does a
      		  split lock, but allow it to continue
      		  running.
      	fatal	- Send SIGBUS to applications that cause split lock
      
      On systems that support split lock detection the default is "warn". Note
      that if the kernel hits a split lock in any mode other than "off" it will
      OOPs.
      
      One implementation wrinkle is that the MSR to control the split lock
      detection is per-core, not per thread. This might result in some short
      lived races on HT systems in "warn" mode if Linux tries to enable on one
      thread while disabling on the other. Race analysis by Sean Christopherson:
      
        - Toggling of split-lock is only done in "warn" mode.  Worst case
          scenario of a race is that a misbehaving task will generate multiple
          #AC exceptions on the same instruction.  And this race will only occur
          if both siblings are running tasks that generate split-lock #ACs, e.g.
          a race where sibling threads are writing different values will only
          occur if CPUx is disabling split-lock after an #AC and CPUy is
          re-enabling split-lock after *its* previous task generated an #AC.
        - Transitioning between off/warn/fatal modes at runtime isn't supported
          and disabling is tracked per task, so hardware will always reach a steady
          state that matches the configured mode.  I.e. split-lock is guaranteed to
          be enabled in hardware once all _TIF_SLD threads have been scheduled out.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Co-developed-by: NFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Co-developed-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
      6650cdd9
  9. 19 2月, 2020 1 次提交
  10. 10 2月, 2020 2 次提交
  11. 05 2月, 2020 1 次提交
  12. 01 2月, 2020 1 次提交
  13. 25 1月, 2020 2 次提交
    • J
      rcu: Remove kfree_call_rcu_nobatch() · 189a6883
      Joel Fernandes (Google) 提交于
      Now that the kfree_rcu() special-casing has been removed from tree RCU,
      this commit removes kfree_call_rcu_nobatch() since it is no longer needed.
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      189a6883
    • J
      rcuperf: Add kfree_rcu() performance Tests · e6e78b00
      Joel Fernandes (Google) 提交于
      This test runs kfree_rcu() in a loop to measure performance of the new
      kfree_rcu() batching functionality.
      
      The following table shows results when booting with arguments:
      rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000
      rcuperf.kfree_rcu_test=1 rcuperf.kfree_no_batch=X
      
      rcuperf.kfree_no_batch=X    # Grace Periods	Test Duration (s)
        X=1 (old behavior)              9133                 11.5
        X=0 (new behavior)              1732                 12.5
      
      On a 16 CPU system with the above boot parameters, we see that the total
      number of grace periods that elapse during the test drops from 9133 when
      not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
      flood itself slows down a bit when batching, though, as shown.
      
      Note that the active memory consumption during the kfree_rcu() flood
      does increase to around 200-250MB due to the batching (from around 50MB
      without batching). However, this memory consumption is relatively
      constant. In other words, the system is able to keep up with the
      kfree_rcu() load. The memory consumption comes down considerably if
      KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80. A later patch will
      reduce memory consumption further by using multiple lists.
      
      Also, when running the test, please disable CONFIG_DEBUG_PREEMPT and
      CONFIG_PROVE_RCU for realistic comparisons with/without batching.
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      e6e78b00
  14. 22 1月, 2020 1 次提交
    • M
      genirq, sched/isolation: Isolate from handling managed interrupts · 11ea68f5
      Ming Lei 提交于
      The affinity of managed interrupts is completely handled in the kernel and
      cannot be changed via the /proc/irq/* interfaces from user space. As the
      kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent
      vector exhaustion, it can happen that a managed interrupt whose affinity
      mask contains both isolated and housekeeping CPUs is routed to an isolated
      CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
      on the isolated CPU.
      
      Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
      logic in the interrupt affinity selection code.
      
      The subparameter indicates to the interrupt affinity selection logic that
      it should try to avoid the above scenario.
      
      This isolation is best effort and only effective if the automatically
      assigned interrupt mask of a device queue contains isolated and
      housekeeping CPUs. If housekeeping CPUs are online then such interrupts are
      directed to the housekeeping CPU so that IO submitted on the housekeeping
      CPU cannot disturb the isolated CPU.
      
      If a queue's affinity mask contains only isolated CPUs then this parameter
      has no effect on the interrupt routing decision, though interrupts are only
      happening when tasks running on those isolated CPUs submit IO. IO submitted
      on housekeeping CPUs has no influence on those queues.
      
      If the affinity mask contains both housekeeping and isolated CPUs, but none
      of the contained housekeeping CPUs is online, then the interrupt is also
      routed to an isolated CPU. Interrupts are only delivered when one of the
      isolated CPUs in the affinity mask submits IO. If one of the contained
      housekeeping CPUs comes online, the CPU hotplug logic migrates the
      interrupt automatically back to the upcoming housekeeping CPU. Depending on
      the type of interrupt controller, this can require that at least one
      interrupt is delivered to the isolated CPU in order to complete the
      migration.
      
      [ tglx: Removed unused parameter, added and edited comments/documentation
        	and rephrased the changelog so it contains more details. ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
      11ea68f5
  15. 20 1月, 2020 1 次提交
    • A
      efi/x86: Limit EFI old memory map to SGI UV machines · 1f299fad
      Ard Biesheuvel 提交于
      We carry a quirk in the x86 EFI code to switch back to an older
      method of mapping the EFI runtime services memory regions, because
      it was deemed risky at the time to implement a new method without
      providing a fallback to the old method in case problems arose.
      
      Such problems did arise, but they appear to be limited to SGI UV1
      machines, and so these are the only ones for which the fallback gets
      enabled automatically (via a DMI quirk). The fallback can be enabled
      manually as well, by passing efi=old_map, but there is very little
      evidence that suggests that this is something that is being relied
      upon in the field.
      
      Given that UV1 support is not enabled by default by the distros
      (Ubuntu, Fedora), there is no point in carrying this fallback code
      all the time if there are no other users. So let's move it into the
      UV support code, and document that efi=old_map now requires this
      support code to be enabled.
      
      Note that efi=old_map has been used in the past on other SGI UV
      machines to work around kernel regressions in production, so we
      keep the option to enable it by hand, but only if the kernel was
      built with UV support.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20200113172245.27925-8-ardb@kernel.org
      1f299fad
  16. 11 1月, 2020 1 次提交
    • M
      efi: Allow disabling PCI busmastering on bridges during boot · 4444f854
      Matthew Garrett 提交于
      Add an option to disable the busmaster bit in the control register on
      all PCI bridges before calling ExitBootServices() and passing control
      to the runtime kernel. System firmware may configure the IOMMU to prevent
      malicious PCI devices from being able to attack the OS via DMA. However,
      since firmware can't guarantee that the OS is IOMMU-aware, it will tear
      down IOMMU configuration when ExitBootServices() is called. This leaves
      a window between where a hostile device could still cause damage before
      Linux configures the IOMMU again.
      
      If CONFIG_EFI_DISABLE_PCI_DMA is enabled or "efi=disable_early_pci_dma"
      is passed on the command line, the EFI stub will clear the busmaster bit
      on all PCI bridges before ExitBootServices() is called. This will
      prevent any malicious PCI devices from being able to perform DMA until
      the kernel reenables busmastering after configuring the IOMMU.
      
      This option may cause failures with some poorly behaved hardware and
      should not be enabled without testing. The kernel commandline options
      "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma" may be
      used to override the default. Note that PCI devices downstream from PCI
      bridges are disconnected from their drivers first, using the UEFI
      driver model API, so that DMA can be disabled safely at the bridge
      level.
      
      [ardb: disconnect PCI I/O handles first, as suggested by Arvind]
      Co-developed-by: NMatthew Garrett <mjg59@google.com>
      Signed-off-by: NMatthew Garrett <mjg59@google.com>
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Cc: Matthew Garrett <matthewgarrett@google.com>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200103113953.9571-18-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4444f854
  17. 08 1月, 2020 1 次提交
    • S
      Documentation,selinux: fix references to old selinuxfs mount point · d41415eb
      Stephen Smalley 提交于
      selinuxfs was originally mounted on /selinux, and various docs and
      kconfig help texts referred to nodes under it.  In Linux 3.0,
      /sys/fs/selinux was introduced as the preferred mount point for selinuxfs.
      Fix all the old references to /selinux/ to /sys/fs/selinux/.
      While we are there, update the description of the selinux boot parameter
      to reflect the fact that the default value is always 1 since
      commit be6ec88f ("selinux: Remove SECURITY_SELINUX_BOOTPARAM_VALUE")
      and drop discussion of runtime disable since it is deprecated.
      Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      d41415eb
  18. 23 11月, 2019 1 次提交
  19. 19 11月, 2019 1 次提交
    • Y
      ACPI: sysfs: Change ACPI_MASKABLE_GPE_MAX to 0x100 · a7583e72
      Yunfeng Ye 提交于
      The commit 0f27cff8 ("ACPI: sysfs: Make ACPI GPE mask kernel
      parameter cover all GPEs") says:
        "Use a bitmap of size 0xFF instead of a u64 for the GPE mask so 256
         GPEs can be masked"
      
      But the masking of GPE 0xFF it not supported and the check condition
      "gpe > ACPI_MASKABLE_GPE_MAX" is not valid because the type of gpe is
      u8.
      
      So modify the macro ACPI_MASKABLE_GPE_MAX to 0x100, and drop the "gpe >
      ACPI_MASKABLE_GPE_MAX" check. In addition, update the docs "Format" for
      acpi_mask_gpe parameter.
      
      Fixes: 0f27cff8 ("ACPI: sysfs: Make ACPI GPE mask kernel parameter cover all GPEs")
      Signed-off-by: NYunfeng Ye <yeyunfeng@huawei.com>
      [ rjw: Use u16 as gpe data type in acpi_gpe_apply_masked_gpes() ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a7583e72
  20. 18 11月, 2019 1 次提交
  21. 16 11月, 2019 1 次提交
    • W
      x86/speculation: Fix incorrect MDS/TAA mitigation status · 64870ed1
      Waiman Long 提交于
      For MDS vulnerable processors with TSX support, enabling either MDS or
      TAA mitigations will enable the use of VERW to flush internal processor
      buffers at the right code path. IOW, they are either both mitigated
      or both not. However, if the command line options are inconsistent,
      the vulnerabilites sysfs files may not report the mitigation status
      correctly.
      
      For example, with only the "mds=off" option:
      
        vulnerabilities/mds:Vulnerable; SMT vulnerable
        vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
      
      The mds vulnerabilities file has wrong status in this case. Similarly,
      the taa vulnerability file will be wrong with mds mitigation on, but
      taa off.
      
      Change taa_select_mitigation() to sync up the two mitigation status
      and have them turned off if both "mds=off" and "tsx_async_abort=off"
      are present.
      
      Update documentation to emphasize the fact that both "mds=off" and
      "tsx_async_abort=off" have to be specified together for processors that
      are affected by both TAA and MDS to be effective.
      
       [ bp: Massage and add kernel-parameters.txt change too. ]
      
      Fixes: 1b42f017 ("x86/speculation/taa: Add mitigation for TSX Async Abort")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Mark Gross <mgross@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com
      64870ed1
  22. 13 11月, 2019 1 次提交
  23. 07 11月, 2019 2 次提交
    • D
      x86/efi: Add efi_fake_mem support for EFI_MEMORY_SP · 199c8471
      Dan Williams 提交于
      Given that EFI_MEMORY_SP is platform BIOS policy decision for marking
      memory ranges as "reserved for a specific purpose" there will inevitably
      be scenarios where the BIOS omits the attribute in situations where it
      is desired. Unlike other attributes if the OS wants to reserve this
      memory from the kernel the reservation needs to happen early in init. So
      early, in fact, that it needs to happen before e820__memblock_setup()
      which is a pre-requisite for efi_fake_memmap() that wants to allocate
      memory for the updated table.
      
      Introduce an x86 specific efi_fake_memmap_early() that can search for
      attempts to set EFI_MEMORY_SP via efi_fake_mem and update the e820 table
      accordingly.
      
      The KASLR code that scans the command line looking for user-directed
      memory reservations also needs to be updated to consider
      "efi_fake_mem=nn@ss:0x40000" requests.
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      199c8471
    • D
      efi: Common enable/disable infrastructure for EFI soft reservation · b617c526
      Dan Williams 提交于
      UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
      interpretation of the EFI Memory Types as "reserved for a specific
      purpose".
      
      The proposed Linux behavior for specific purpose memory is that it is
      reserved for direct-access (device-dax) by default and not available for
      any kernel usage, not even as an OOM fallback.  Later, through udev
      scripts or another init mechanism, these device-dax claimed ranges can
      be reconfigured and hot-added to the available System-RAM with a unique
      node identifier. This device-dax management scheme implements "soft" in
      the "soft reserved" designation by allowing some or all of the
      reservation to be recovered as typical memory. This policy can be
      disabled at compile-time with CONFIG_EFI_SOFT_RESERVE=n, or runtime with
      efi=nosoftreserve.
      
      As for this patch, define the common helpers to determine if the
      EFI_MEMORY_SP attribute should be honored. The determination needs to be
      made early to prevent the kernel from being loaded into soft-reserved
      memory, or otherwise allowing early allocations to land there. Follow-on
      changes are needed per architecture to leverage these helpers in their
      respective mem-init paths.
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      b617c526
  24. 05 11月, 2019 1 次提交
  25. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  26. 28 10月, 2019 3 次提交
  27. 26 10月, 2019 1 次提交
  28. 23 10月, 2019 1 次提交
  29. 22 10月, 2019 1 次提交
    • S
      arm64: Retrieve stolen time as paravirtualized guest · e0685fa2
      Steven Price 提交于
      Enable paravirtualization features when running under a hypervisor
      supporting the PV_TIME_ST hypercall.
      
      For each (v)CPU, we ask the hypervisor for the location of a shared
      page which the hypervisor will use to report stolen time to us. We set
      pv_time_ops to the stolen time function which simply reads the stolen
      value from the shared page for a VCPU. We guarantee single-copy
      atomicity using READ_ONCE which means we can also read the stolen
      time for another VCPU than the currently running one while it is
      potentially being updated by the hypervisor.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      e0685fa2
  30. 16 10月, 2019 1 次提交