1. 22 1月, 2020 1 次提交
    • M
      genirq, sched/isolation: Isolate from handling managed interrupts · 11ea68f5
      Ming Lei 提交于
      The affinity of managed interrupts is completely handled in the kernel and
      cannot be changed via the /proc/irq/* interfaces from user space. As the
      kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent
      vector exhaustion, it can happen that a managed interrupt whose affinity
      mask contains both isolated and housekeeping CPUs is routed to an isolated
      CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
      on the isolated CPU.
      
      Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
      logic in the interrupt affinity selection code.
      
      The subparameter indicates to the interrupt affinity selection logic that
      it should try to avoid the above scenario.
      
      This isolation is best effort and only effective if the automatically
      assigned interrupt mask of a device queue contains isolated and
      housekeeping CPUs. If housekeeping CPUs are online then such interrupts are
      directed to the housekeeping CPU so that IO submitted on the housekeeping
      CPU cannot disturb the isolated CPU.
      
      If a queue's affinity mask contains only isolated CPUs then this parameter
      has no effect on the interrupt routing decision, though interrupts are only
      happening when tasks running on those isolated CPUs submit IO. IO submitted
      on housekeeping CPUs has no influence on those queues.
      
      If the affinity mask contains both housekeeping and isolated CPUs, but none
      of the contained housekeeping CPUs is online, then the interrupt is also
      routed to an isolated CPU. Interrupts are only delivered when one of the
      isolated CPUs in the affinity mask submits IO. If one of the contained
      housekeeping CPUs comes online, the CPU hotplug logic migrates the
      interrupt automatically back to the upcoming housekeeping CPU. Depending on
      the type of interrupt controller, this can require that at least one
      interrupt is delivered to the isolated CPU in order to complete the
      migration.
      
      [ tglx: Removed unused parameter, added and edited comments/documentation
        	and rephrased the changelog so it contains more details. ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
      11ea68f5
  2. 22 12月, 2019 1 次提交
  3. 12 12月, 2019 1 次提交
  4. 10 12月, 2019 1 次提交
  5. 01 12月, 2019 2 次提交
  6. 26 11月, 2019 1 次提交
  7. 23 11月, 2019 1 次提交
  8. 22 11月, 2019 1 次提交
    • K
      block: add iostat counters for flush requests · b6866318
      Konstantin Khlebnikov 提交于
      Requests that triggers flushing volatile writeback cache to disk (barriers)
      have significant effect to overall performance.
      
      Block layer has sophisticated engine for combining several flush requests
      into one. But there is no statistics for actual flushes executed by disk.
      Requests which trigger flushes usually are barriers - zero-size writes.
      
      This patch adds two iostat counters into /sys/class/block/$dev/stat and
      /proc/diskstats - count of completed flush requests and their total time.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b6866318
  9. 19 11月, 2019 1 次提交
    • Y
      ACPI: sysfs: Change ACPI_MASKABLE_GPE_MAX to 0x100 · a7583e72
      Yunfeng Ye 提交于
      The commit 0f27cff8 ("ACPI: sysfs: Make ACPI GPE mask kernel
      parameter cover all GPEs") says:
        "Use a bitmap of size 0xFF instead of a u64 for the GPE mask so 256
         GPEs can be masked"
      
      But the masking of GPE 0xFF it not supported and the check condition
      "gpe > ACPI_MASKABLE_GPE_MAX" is not valid because the type of gpe is
      u8.
      
      So modify the macro ACPI_MASKABLE_GPE_MAX to 0x100, and drop the "gpe >
      ACPI_MASKABLE_GPE_MAX" check. In addition, update the docs "Format" for
      acpi_mask_gpe parameter.
      
      Fixes: 0f27cff8 ("ACPI: sysfs: Make ACPI GPE mask kernel parameter cover all GPEs")
      Signed-off-by: NYunfeng Ye <yeyunfeng@huawei.com>
      [ rjw: Use u16 as gpe data type in acpi_gpe_apply_masked_gpes() ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a7583e72
  10. 18 11月, 2019 2 次提交
  11. 16 11月, 2019 2 次提交
    • W
      x86/speculation: Fix incorrect MDS/TAA mitigation status · 64870ed1
      Waiman Long 提交于
      For MDS vulnerable processors with TSX support, enabling either MDS or
      TAA mitigations will enable the use of VERW to flush internal processor
      buffers at the right code path. IOW, they are either both mitigated
      or both not. However, if the command line options are inconsistent,
      the vulnerabilites sysfs files may not report the mitigation status
      correctly.
      
      For example, with only the "mds=off" option:
      
        vulnerabilities/mds:Vulnerable; SMT vulnerable
        vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
      
      The mds vulnerabilities file has wrong status in this case. Similarly,
      the taa vulnerability file will be wrong with mds mitigation on, but
      taa off.
      
      Change taa_select_mitigation() to sync up the two mitigation status
      and have them turned off if both "mds=off" and "tsx_async_abort=off"
      are present.
      
      Update documentation to emphasize the fact that both "mds=off" and
      "tsx_async_abort=off" have to be specified together for processors that
      are affected by both TAA and MDS to be effective.
      
       [ bp: Massage and add kernel-parameters.txt change too. ]
      
      Fixes: 1b42f017 ("x86/speculation/taa: Add mitigation for TSX Async Abort")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Mark Gross <mgross@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com
      64870ed1
    • M
      dm integrity: fix excessive alignment of metadata runs · d537858a
      Mikulas Patocka 提交于
      Metadata runs are supposed to be aligned on 4k boundary (so that they work
      efficiently with disks with 4k sectors). However, there was a programming
      bug that makes them aligned on 128k boundary instead. The unused space is
      wasted.
      
      Fix this bug by providing a proper 4k alignment. In order to keep
      existing volumes working, we introduce a new flag SB_FLAG_FIXED_PADDING
      - when the flag is clear, we calculate the padding the old way. In order
      to make sure that the old version cannot mount the volume created by the
      new version, we increase superblock version to 4.
      
      Also in order to not break with old integritysetup, we fix alignment
      only if the parameter "fix_padding" is present when formatting the
      device.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d537858a
  12. 13 11月, 2019 1 次提交
  13. 10 11月, 2019 1 次提交
  14. 08 11月, 2019 2 次提交
  15. 07 11月, 2019 2 次提交
    • D
      x86/efi: Add efi_fake_mem support for EFI_MEMORY_SP · 199c8471
      Dan Williams 提交于
      Given that EFI_MEMORY_SP is platform BIOS policy decision for marking
      memory ranges as "reserved for a specific purpose" there will inevitably
      be scenarios where the BIOS omits the attribute in situations where it
      is desired. Unlike other attributes if the OS wants to reserve this
      memory from the kernel the reservation needs to happen early in init. So
      early, in fact, that it needs to happen before e820__memblock_setup()
      which is a pre-requisite for efi_fake_memmap() that wants to allocate
      memory for the updated table.
      
      Introduce an x86 specific efi_fake_memmap_early() that can search for
      attempts to set EFI_MEMORY_SP via efi_fake_mem and update the e820 table
      accordingly.
      
      The KASLR code that scans the command line looking for user-directed
      memory reservations also needs to be updated to consider
      "efi_fake_mem=nn@ss:0x40000" requests.
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      199c8471
    • D
      efi: Common enable/disable infrastructure for EFI soft reservation · b617c526
      Dan Williams 提交于
      UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
      interpretation of the EFI Memory Types as "reserved for a specific
      purpose".
      
      The proposed Linux behavior for specific purpose memory is that it is
      reserved for direct-access (device-dax) by default and not available for
      any kernel usage, not even as an OOM fallback.  Later, through udev
      scripts or another init mechanism, these device-dax claimed ranges can
      be reconfigured and hot-added to the available System-RAM with a unique
      node identifier. This device-dax management scheme implements "soft" in
      the "soft reserved" designation by allowing some or all of the
      reservation to be recovered as typical memory. This policy can be
      disabled at compile-time with CONFIG_EFI_SOFT_RESERVE=n, or runtime with
      efi=nosoftreserve.
      
      As for this patch, define the common helpers to determine if the
      EFI_MEMORY_SP attribute should be honored. The determination needs to be
      made early to prevent the kernel from being loaded into soft-reserved
      memory, or otherwise allowing early allocations to land there. Follow-on
      changes are needed per architecture to leverage these helpers in their
      respective mem-init paths.
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      b617c526
  16. 06 11月, 2019 1 次提交
  17. 05 11月, 2019 4 次提交
  18. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  19. 29 10月, 2019 1 次提交
  20. 28 10月, 2019 3 次提交
  21. 26 10月, 2019 1 次提交
  22. 23 10月, 2019 1 次提交
  23. 22 10月, 2019 1 次提交
    • S
      arm64: Retrieve stolen time as paravirtualized guest · e0685fa2
      Steven Price 提交于
      Enable paravirtualization features when running under a hypervisor
      supporting the PV_TIME_ST hypercall.
      
      For each (v)CPU, we ask the hypervisor for the location of a shared
      page which the hypervisor will use to report stolen time to us. We set
      pv_time_ops to the stolen time function which simply reads the stolen
      value from the shared page for a VCPU. We guarantee single-copy
      atomicity using READ_ONCE which means we can also read the stolen
      time for another VCPU than the currently running one while it is
      potentially being updated by the hypervisor.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      e0685fa2
  24. 18 10月, 2019 1 次提交
  25. 16 10月, 2019 5 次提交
  26. 11 10月, 2019 1 次提交