1. 25 10月, 2021 38 次提交
    • T
      x86/fpu: Make init_fpstate correct with optimized XSAVE · d2994844
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit e829367f47218de04587c2df3c4cb5ef87e35648
      
      --------------------------------
      
      commit f9dfb5e3 upstream.
      
      The XSAVE init code initializes all enabled and supported components with
      XRSTOR(S) to init state. Then it XSAVEs the state of the components back
      into init_fpstate which is used in several places to fill in the init state
      of components.
      
      This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
      those use the init optimization and skip writing state of components which
      are in init state. So init_fpstate.xsave still contains all zeroes after
      this operation.
      
      There are two ways to solve that:
      
         1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
            XSAVES is enabled because XSAVES uses compacted format.
      
         2) Save the components which are known to have a non-zero init state by other
            means.
      
      Looking deeper, #2 is the right thing to do because all components the
      kernel supports have all-zeroes init state except the legacy features (FP,
      SSE). Those cannot be hard coded because the states are not identical on all
      CPUs, but they can be saved with FXSAVE which avoids all conditionals.
      
      Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
      a BUILD_BUG_ON() which reminds developers to validate that a newly added
      component has all zeroes init state. As a bonus remove the now unused
      copy_xregs_to_kernel_booting() crutch.
      
      The XSAVE and reshuffle method can still be implemented in the unlikely
      case that components are added which have a non-zero init state and no
      other means to save them. For now, FXSAVE is just simple and good enough.
      
        [ bp: Fix a typo or two in the text. ]
      
      Fixes: 6bad06b7 ("x86, xsave: Use xsaveopt in context-switch path when supported")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d2994844
    • S
      iommu/vt-d: Fix agaw for a supported 48 bit guest address width · 479309c5
      Saeed Mirzamohammadi 提交于
      stable inclusion
      from linux-4.19.205
      commit 6a9449e9568808930e7d4d83c33e320329113a67
      
      --------------------------------
      
      [ Upstream commit 327d5b2f ]
      
      The IOMMU driver calculates the guest addressability for a DMA request
      based on the value of the mgaw reported from the IOMMU. However, this
      is a fused value and as mentioned in the spec, the guest width
      should be calculated based on the minimum of supported adjusted guest
      address width (SAGAW) and MGAW.
      
      This is from specification:
      "Guest addressability for a given DMA request is limited to the
      minimum of the value reported through this field and the adjusted
      guest address width of the corresponding page-table structure.
      (Adjusted guest address widths supported by hardware are reported
      through the SAGAW field)."
      
      This causes domain initialization to fail and following
      errors appear for EHCI PCI driver:
      
      [    2.486393] ehci-pci 0000:01:00.4: EHCI Host Controller
      [    2.486624] ehci-pci 0000:01:00.4: new USB bus registered, assigned bus
      number 1
      [    2.489127] ehci-pci 0000:01:00.4: DMAR: Allocating domain failed
      [    2.489350] ehci-pci 0000:01:00.4: DMAR: 32bit DMA uses non-identity
      mapping
      [    2.489359] ehci-pci 0000:01:00.4: can't setup: -12
      [    2.489531] ehci-pci 0000:01:00.4: USB bus 1 deregistered
      [    2.490023] ehci-pci 0000:01:00.4: init 0000:01:00.4 fail, -12
      [    2.490358] ehci-pci: probe of 0000:01:00.4 failed with error -12
      
      This issue happens when the value of the sagaw corresponds to a
      48-bit agaw. This fix updates the calculation of the agaw based on
      the minimum of IOMMU's sagaw value and MGAW.
      
      This issue happens on the code path of getting a private domain for a
      device. A private domain was needed when the domain of an iommu group
      couldn't meet the requirement of a device. The IOMMU core has been
      evolved to eliminate the need for private domain, hence this code path
      has alreay been removed from the upstream since commit 327d5b2f
      ("iommu/vt-d: Allow 32bit devices to uses DMA domain"). Instead of back
      porting all patches that are required for removing the private domain,
      this simply fixes it in the affected stable kernel between v4.16 and v5.7.
      
      [baolu: The orignal patch could be found here
       https://lore.kernel.org/linux-iommu/20210412202736.70765-1-saeed.mirzamohammadi@oracle.com/.
       I added commit message according to Greg's comments at
       https://lore.kernel.org/linux-iommu/YHZ%2FT9x7Xjf1r6fI@kroah.com/.]
      
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Cc: stable@vger.kernel.org #v4.16+
      Signed-off-by: NSaeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
      Tested-by: NCamille Lu <camille.lu@hpe.com>
      Signed-off-by: NLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      479309c5
    • T
      PCI/MSI: Enforce MSI[X] entry updates to be visible · 256535c0
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 153cc7c9dfefe646c8b2a74eb925b6620b915154
      
      --------------------------------
      
      commit b9255a7c upstream.
      
      Nothing enforces the posted writes to be visible when the function
      returns. Flush them even if the flush might be redundant when the entry is
      masked already as the unmask will flush as well. This is either setup or a
      rare affinity change event so the extra flush is not the end of the world.
      
      While this is more a theoretical issue especially the logic in the X86
      specific msi_set_affinity() function relies on the assumption that the
      update has reached the hardware when the function returns.
      
      Again, as this never has been enforced the Fixes tag refers to a commit in:
         git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      
      Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.515188147@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      256535c0
    • T
      PCI/MSI: Enforce that MSI-X table entry is masked for update · 1cca3beb
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit b590b85fc91979a97cbb4ab1bcf888aa245cd5e3
      
      --------------------------------
      
      commit da181dc9 upstream.
      
      The specification (PCIe r5.0, sec 6.1.4.5) states:
      
          For MSI-X, a function is permitted to cache Address and Data values
          from unmasked MSI-X Table entries. However, anytime software unmasks a
          currently masked MSI-X Table entry either by clearing its Mask bit or
          by clearing the Function Mask bit, the function must update any Address
          or Data values that it cached from that entry. If software changes the
          Address or Data value of an entry while the entry is unmasked, the
          result is undefined.
      
      The Linux kernel's MSI-X support never enforced that the entry is masked
      before the entry is modified hence the Fixes tag refers to a commit in:
            git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      
      Enforce the entry to be masked across the update.
      
      There is no point in enforcing this to be handled at all possible call
      sites as this is just pointless code duplication and the common update
      function is the obvious place to enforce this.
      
      Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support")
      Reported-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.462096385@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1cca3beb
    • T
      PCI/MSI: Mask all unused MSI-X entries · 2de6a011
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 3b570884c868c12e3184627ce4b4a167e9d6f018
      
      --------------------------------
      
      commit 7d5ec3d3 upstream.
      
      When MSI-X is enabled the ordering of calls is:
      
        msix_map_region();
        msix_setup_entries();
        pci_msi_setup_msi_irqs();
        msix_program_entries();
      
      This has a few interesting issues:
      
       1) msix_setup_entries() allocates the MSI descriptors and initializes them
          except for the msi_desc:masked member which is left zero initialized.
      
       2) pci_msi_setup_msi_irqs() allocates the interrupt descriptors and sets
          up the MSI interrupts which ends up in pci_write_msi_msg() unless the
          interrupt chip provides its own irq_write_msi_msg() function.
      
       3) msix_program_entries() does not do what the name suggests. It solely
          updates the entries array (if not NULL) and initializes the masked
          member for each MSI descriptor by reading the hardware state and then
          masks the entry.
      
      Obviously this has some issues:
      
       1) The uninitialized masked member of msi_desc prevents the enforcement
          of masking the entry in pci_write_msi_msg() depending on the cached
          masked bit. Aside of that half initialized data is a NONO in general
      
       2) msix_program_entries() only ensures that the actually allocated entries
          are masked. This is wrong as experimentation with crash testing and
          crash kernel kexec has shown.
      
          This limited testing unearthed that when the production kernel had more
          entries in use and unmasked when it crashed and the crash kernel
          allocated a smaller amount of entries, then a full scan of all entries
          found unmasked entries which were in use in the production kernel.
      
          This is obviously a device or emulation issue as the device reset
          should mask all MSI-X table entries, but obviously that's just part
          of the paper specification.
      
      Cure this by:
      
       1) Masking all table entries in hardware
       2) Initializing msi_desc::masked in msix_setup_entries()
       3) Removing the mask dance in msix_program_entries()
       4) Renaming msix_program_entries() to msix_update_entries() to
          reflect the purpose of that function.
      
      As the masking of unused entries has never been done the Fixes tag refers
      to a commit in:
         git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      
      Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.403833459@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2de6a011
    • T
      PCI/MSI: Protect msi_desc::masked for multi-MSI · a2734433
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 3c9534778d4cc2bd01e20d4dcffc55df0962aa12
      
      --------------------------------
      
      commit 77e89afc upstream.
      
      Multi-MSI uses a single MSI descriptor and there is a single mask register
      when the device supports per vector masking. To avoid reading back the mask
      register the value is cached in the MSI descriptor and updates are done by
      clearing and setting bits in the cache and writing it to the device.
      
      But nothing protects msi_desc::masked and the mask register from being
      modified concurrently on two different CPUs for two different Linux
      interrupts which belong to the same multi-MSI descriptor.
      
      Add a lock to struct device and protect any operation on the mask and the
      mask register with it.
      
      This makes the update of msi_desc::masked unconditional, but there is no
      place which requires a modification of the hardware register without
      updating the masked cache.
      
      msi_mask_irq() is now an empty wrapper which will be cleaned up in follow
      up changes.
      
      The problem goes way back to the initial support of multi-MSI, but picking
      the commit which introduced the mask cache is a valid cut off point
      (2.6.30).
      
      Fixes: f2440d9a ("PCI MSI: Refactor interrupt masking code")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.726833414@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a2734433
    • T
      PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown() · e6482181
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 1b36c30a9335db941423c05b49a8266a84a82f95
      
      --------------------------------
      
      commit d28d4ad2 upstream.
      
      No point in using the raw write function from shutdown. Preparatory change
      to introduce proper serialization for the msi_desc::masked cache.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.674391354@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e6482181
    • T
      PCI/MSI: Correct misleading comments · e521a20d
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit c5b223cd04706589e5e6840e2ab7c4f879323ed9
      
      --------------------------------
      
      commit 689e6b53 upstream.
      
      The comments about preserving the cached state in pci_msi[x]_shutdown() are
      misleading as the MSI descriptors are freed right after those functions
      return. So there is nothing to restore. Preparatory change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.621609423@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e521a20d
    • T
      PCI/MSI: Do not set invalid bits in MSI mask · d7692543
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 22f4a36d086d74f7abe9c4eaf65204048cd84f9c
      
      --------------------------------
      
      commit 361fd373 upstream.
      
      msi_mask_irq() takes a mask and a flags argument. The mask argument is used
      to mask out bits from the cached mask and the flags argument to set bits.
      
      Some places invoke it with a flags argument which sets bits which are not
      used by the device, i.e. when the device supports up to 8 vectors a full
      unmask in some places sets the mask to 0xFFFFFF00. While devices probably
      do not care, it's still bad practice.
      
      Fixes: 7ba1930d ("PCI MSI: Unmask MSI if setup failed")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.568173099@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d7692543
    • T
      PCI/MSI: Enable and mask MSI-X early · 006d934a
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 6aea847496c8c9a37a5df795c4fe42a0e5fcccc5
      
      --------------------------------
      
      commit 43855395 upstream.
      
      The ordering of MSI-X enable in hardware is dysfunctional:
      
       1) MSI-X is disabled in the control register
       2) Various setup functions
       3) pci_msi_setup_msi_irqs() is invoked which ends up accessing
          the MSI-X table entries
       4) MSI-X is enabled and masked in the control register with the
          comment that enabling is required for some hardware to access
          the MSI-X table
      
      Step #4 obviously contradicts #3. The history of this is an issue with the
      NIU hardware. When #4 was introduced the table access actually happened in
      msix_program_entries() which was invoked after enabling and masking MSI-X.
      
      This was changed in commit d71d6432 ("PCI/MSI: Kill redundant call of
      irq_set_msi_desc() for MSI-X interrupts") which removed the table write
      from msix_program_entries().
      
      Interestingly enough nobody noticed and either NIU still works or it did
      not get any testing with a kernel 3.19 or later.
      
      Nevertheless this is inconsistent and there is no reason why MSI-X can't be
      enabled and masked in the control register early on, i.e. move step #4
      above to step #1. This preserves the NIU workaround and has no side effects
      on other hardware.
      
      Fixes: d71d6432 ("PCI/MSI: Kill redundant call of irq_set_msi_desc() for MSI-X interrupts")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NAshok Raj <ashok.raj@intel.com>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.344136412@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      006d934a
    • B
      genirq/msi: Ensure deactivation on teardown · eaabd1aa
      Bixuan Cui 提交于
      stable inclusion
      from linux-4.19.205
      commit 504a4c1057151a1f1332fb3ce940134db8d6b885
      
      --------------------------------
      
      commit dbbc9357 upstream.
      
      msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but
      msi_domain_free_irqs() does not enforce deactivation before tearing down
      the interrupts.
      
      This happens when PCI/MSI interrupts are set up and never used before being
      torn down again, e.g. in error handling pathes. The only place which cleans
      that up is the error handling path in msi_domain_alloc_irqs().
      
      Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs()
      to cure that.
      
      Fixes: f3b0946d ("genirq/msi: Make sure PCI MSIs are activated early")
      Signed-off-by: NBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210518033117.78104-1-cuibixuan@huawei.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      eaabd1aa
    • T
      x86/ioapic: Force affinity setup before startup · 5c126be8
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 697658a61db4f3aa213d76336ccf30e66e6c44ca
      
      --------------------------------
      
      commit 0c0e37dc upstream.
      
      The IO/APIC cannot handle interrupt affinity changes safely after startup
      other than from an interrupt handler. The startup sequence in the generic
      interrupt code violates that assumption.
      
      Mark the irq chip with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
      the default interrupt setting happens before the interrupt is started up
      for the first time.
      
      Fixes: 18404756 ("genirq: Expose default irq affinity mask (take 3)")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.832143400@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5c126be8
    • T
      x86/msi: Force affinity setup before startup · 297ac420
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit 354b210062b1e50ef284f97590011c2231316eaa
      
      --------------------------------
      
      commit ff363f48 upstream.
      
      The X86 MSI mechanism cannot handle interrupt affinity changes safely after
      startup other than from an interrupt handler, unless interrupt remapping is
      enabled. The startup sequence in the generic interrupt code violates that
      assumption.
      
      Mark the irq chips with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
      the default interrupt setting happens before the interrupt is started up
      for the first time.
      
      While the interrupt remapping MSI chip does not require this, there is no
      point in treating it differently as this might spare an interrupt to a CPU
      which is not in the default affinity mask.
      
      For the non-remapping case go to the direct write path when the interrupt
      is not yet started similar to the not yet activated case.
      
      Fixes: 18404756 ("genirq: Expose default irq affinity mask (take 3)")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.886722080@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      297ac420
    • T
      genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP · e9b57276
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.205
      commit cab824f67d7e8f68288d615929dec02607e473ad
      
      --------------------------------
      
      commit 826da771 upstream.
      
      X86 IO/APIC and MSI interrupts (when used without interrupts remapping)
      require that the affinity setup on startup is done before the interrupt is
      enabled for the first time as the non-remapped operation mode cannot safely
      migrate enabled interrupts from arbitrary contexts. Provide a new irq chip
      flag which allows affected hardware to request this.
      
      This has to be opt-in because there have been reports in the past that some
      interrupt chips cannot handle affinity setting before startup.
      
      Fixes: 18404756 ("genirq: Expose default irq affinity mask (take 3)")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.779791738@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        include/linux/irq.h
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e9b57276
    • N
      tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets · 808541e9
      Neal Cardwell 提交于
      stable inclusion
      from linux-4.19.205
      commit 32b6627fec712fb75fbed272517c74814c00ccfc
      
      --------------------------------
      
      [ Upstream commit 6de035fe ]
      
      Currently if BBR congestion control is initialized after more than 2B
      packets have been delivered, depending on the phase of the
      tp->delivered counter the tracking of BBR round trips can get stuck.
      
      The bug arises because if tp->delivered is between 2^31 and 2^32 at
      the time the BBR congestion control module is initialized, then the
      initialization of bbr->next_rtt_delivered to 0 will cause the logic to
      believe that the end of the round trip is still billions of packets in
      the future. More specifically, the following check will fail
      repeatedly:
      
        !before(rs->prior_delivered, bbr->next_rtt_delivered)
      
      and thus the connection will take up to 2B packets delivered before
      that check will pass and the connection will set:
      
        bbr->round_start = 1;
      
      This could cause many mechanisms in BBR to fail to trigger, for
      example bbr_check_full_bw_reached() would likely never exit STARTUP.
      
      This bug is 5 years old and has not been observed, and as a practical
      matter this would likely rarely trigger, since it would require
      transferring at least 2B packets, or likely more than 3 terabytes of
      data, before switching congestion control algorithms to BBR.
      
      This patch is a stable candidate for kernels as far back as v4.9,
      when tcp_bbr.c was added.
      
      Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NKevin Yang <yyd@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      808541e9
    • Y
      net: bridge: fix memleak in br_add_if() · c9ea3933
      Yang Yingliang 提交于
      stable inclusion
      from linux-4.19.205
      commit f41237f60cb0202827432706c33faba3adadbfb5
      
      --------------------------------
      
      [ Upstream commit 519133de ]
      
      I got a memleak report:
      
      BUG: memory leak
      unreferenced object 0x607ee521a658 (size 240):
      comm "syz-executor.0", pid 955, jiffies 4294780569 (age 16.449s)
      hex dump (first 32 bytes, cpu 1):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<00000000d830ea5a>] br_multicast_add_port+0x1c2/0x300 net/bridge/br_multicast.c:1693
      [<00000000274d9a71>] new_nbp net/bridge/br_if.c:435 [inline]
      [<00000000274d9a71>] br_add_if+0x670/0x1740 net/bridge/br_if.c:611
      [<0000000012ce888e>] do_set_master net/core/rtnetlink.c:2513 [inline]
      [<0000000012ce888e>] do_set_master+0x1aa/0x210 net/core/rtnetlink.c:2487
      [<0000000099d1cafc>] __rtnl_newlink+0x1095/0x13e0 net/core/rtnetlink.c:3457
      [<00000000a01facc0>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488
      [<00000000acc9186c>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5550
      [<00000000d4aabb9c>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
      [<00000000bc2e12a3>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
      [<00000000bc2e12a3>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
      [<00000000e4dc2d0e>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929
      [<000000000d22c8b3>] sock_sendmsg_nosec net/socket.c:654 [inline]
      [<000000000d22c8b3>] sock_sendmsg+0x139/0x170 net/socket.c:674
      [<00000000e281417a>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
      [<00000000237aa2ab>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404
      [<000000004f2dc381>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433
      [<0000000005feca6c>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47
      [<000000007304477d>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      On error path of br_add_if(), p->mcast_stats allocated in
      new_nbp() need be freed, or it will be leaked.
      
      Fixes: 1080ab95 ("net: bridge: add support for IGMP/MLD stats and export them via netlink")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Link: https://lore.kernel.org/r/20210809132023.978546-1-yangyingliang@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c9ea3933
    • E
      net: igmp: fix data-race in igmp_ifc_timer_expire() · 43253564
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.205
      commit fb5db3106036f4e21a63c0c6b08db4b4f18f157c
      
      --------------------------------
      
      [ Upstream commit 4a2b285e ]
      
      Fix the data-race reported by syzbot [1]
      Issue here is that igmp_ifc_timer_expire() can update in_dev->mr_ifc_count
      while another change just occured from another context.
      
      in_dev->mr_ifc_count is only 8bit wide, so the race had little
      consequences.
      
      [1]
      BUG: KCSAN: data-race in igmp_ifc_event / igmp_ifc_timer_expire
      
      write to 0xffff8881051e3062 of 1 bytes by task 12547 on cpu 0:
       igmp_ifc_event+0x1d5/0x290 net/ipv4/igmp.c:821
       igmp_group_added+0x462/0x490 net/ipv4/igmp.c:1356
       ____ip_mc_inc_group+0x3ff/0x500 net/ipv4/igmp.c:1461
       __ip_mc_join_group+0x24d/0x2c0 net/ipv4/igmp.c:2199
       ip_mc_join_group_ssm+0x20/0x30 net/ipv4/igmp.c:2218
       do_ip_setsockopt net/ipv4/ip_sockglue.c:1285 [inline]
       ip_setsockopt+0x1827/0x2a80 net/ipv4/ip_sockglue.c:1423
       tcp_setsockopt+0x8c/0xa0 net/ipv4/tcp.c:3657
       sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3362
       __sys_setsockopt+0x18f/0x200 net/socket.c:2159
       __do_sys_setsockopt net/socket.c:2170 [inline]
       __se_sys_setsockopt net/socket.c:2167 [inline]
       __x64_sys_setsockopt+0x62/0x70 net/socket.c:2167
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff8881051e3062 of 1 bytes by interrupt on cpu 1:
       igmp_ifc_timer_expire+0x706/0xa30 net/ipv4/igmp.c:808
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1419
       expire_timers+0x135/0x250 kernel/time/timer.c:1464
       __run_timers+0x358/0x420 kernel/time/timer.c:1732
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1745
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x9a/0xb0 kernel/softirq.c:636
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1100
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
       console_unlock+0x8e8/0xb30 kernel/printk/printk.c:2646
       vprintk_emit+0x125/0x3d0 kernel/printk/printk.c:2174
       vprintk_default+0x22/0x30 kernel/printk/printk.c:2185
       vprintk+0x15a/0x170 kernel/printk/printk_safe.c:392
       printk+0x62/0x87 kernel/printk/printk.c:2216
       selinux_netlink_send+0x399/0x400 security/selinux/hooks.c:6041
       security_netlink_send+0x42/0x90 security/security.c:2070
       netlink_sendmsg+0x59e/0x7c0 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:703 [inline]
       sock_sendmsg net/socket.c:723 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
       ___sys_sendmsg net/socket.c:2446 [inline]
       __sys_sendmsg+0x1ed/0x270 net/socket.c:2475
       __do_sys_sendmsg net/socket.c:2484 [inline]
       __se_sys_sendmsg net/socket.c:2482 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2482
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x01 -> 0x02
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 12539 Comm: syz-executor.1 Not tainted 5.14.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      43253564
    • D
      ACPI: NFIT: Fix support for virtual SPA ranges · 1231368b
      Dan Williams 提交于
      stable inclusion
      from linux-4.19.205
      commit c39e22fd3f7ce3af64140f560ea63b0c986a46db
      
      --------------------------------
      
      commit b93dfa6b upstream.
      
      Fix the NFIT parsing code to treat a 0 index in a SPA Range Structure as
      a special case and not match Region Mapping Structures that use 0 to
      indicate that they are not mapped. Without this fix some platform BIOS
      descriptions of "virtual disk" ranges do not result in the pmem driver
      attaching to the range.
      
      Details:
      In addition to typical persistent memory ranges, the ACPI NFIT may also
      convey "virtual" ranges. These ranges are indicated by a UUID in the SPA
      Range Structure of UUID_VOLATILE_VIRTUAL_DISK, UUID_VOLATILE_VIRTUAL_CD,
      UUID_PERSISTENT_VIRTUAL_DISK, or UUID_PERSISTENT_VIRTUAL_CD. The
      critical difference between virtual ranges and UUID_PERSISTENT_MEMORY,
      is that virtual do not support associations with Region Mapping
      Structures.  For this reason the "index" value of virtual SPA Range
      Structures is allowed to be 0. If a platform BIOS decides to represent
      NVDIMMs with disconnected "Region Mapping Structures" (range-index ==
      0), the kernel may falsely associate them with standalone ranges where
      the "SPA Range Structure Index" is also zero. When this happens the
      driver may falsely require labels where "virtual disks" are expected to
      be label-less. I.e. "label-less" is where the namespace-range ==
      region-range and the pmem driver attaches with no user action to create
      a namespace.
      
      Cc: Jacek Zloch <jacek.zloch@intel.com>
      Cc: Lukasz Sobieraj <lukasz.sobieraj@intel.com>
      Cc: "Lee, Chun-Yi" <jlee@suse.com>
      Cc: <stable@vger.kernel.org>
      Fixes: c2f32acd ("acpi, nfit: treat virtual ramdisk SPA as pmem region")
      Reported-by: NKrzysztof Rusocki <krzysztof.rusocki@intel.com>
      Reported-by: NDamian Bassa <damian.bassa@intel.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Link: https://lore.kernel.org/r/162870796589.2521182.1240403310175570220.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1231368b
    • M
      ovl: prevent private clone if bind mount is not allowed · 591c8183
      Miklos Szeredi 提交于
      stable inclusion
      from linux-4.19.204
      commit 963d85d630dabe75a3cfde44a006fec3304d07b8
      
      --------------------------------
      
      commit 427215d8 upstream.
      
      Add the following checks from __do_loopback() to clone_private_mount() as
      well:
      
       - verify that the mount is in the current namespace
      
       - verify that there are no locked children
      Reported-by: NAlois Wohlschlager <alois1@gmx-topmail.de>
      Fixes: c771d683 ("vfs: introduce clone_private_mount()")
      Cc: <stable@vger.kernel.org> # v3.18
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      591c8183
    • M
      tracing: Reject string operand in the histogram expression · cb0c4fa9
      Masami Hiramatsu 提交于
      stable inclusion
      from linux-4.19.204
      commit 7c165d58effc19fdf68196d4ceebf940d5da777d
      
      --------------------------------
      
      commit a9d10ca4 upstream.
      
      Since the string type can not be the target of the addition / subtraction
      operation, it must be rejected. Without this fix, the string type silently
      converted to digits.
      
      Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2
      
      Cc: stable@vger.kernel.org
      Fixes: 100719dc ("tracing: Add simple expression support to hist triggers")
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cb0c4fa9
    • Y
      reiserfs: add check for root_inode in reiserfs_fill_super · 26d2435f
      Yu Kuai 提交于
      stable inclusion
      from linux-4.19.203
      commit df2f583b63637f9f882ba604cf23e0336de82220
      
      --------------------------------
      
      [ Upstream commit 2acf15b9 ]
      
      Our syzcaller report a NULL pointer dereference:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD 116e95067 P4D 116e95067 PUD 1080b5067 PMD 0
      Oops: 0010 [#1] SMP KASAN
      CPU: 7 PID: 592 Comm: a.out Not tainted 5.13.0-next-20210629-dirty #67
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-p4
      RIP: 0010:0x0
      Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
      RSP: 0018:ffff888114e779b8 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 1ffff110229cef39 RCX: ffffffffaa67e1aa
      RDX: 0000000000000000 RSI: ffff88810a58ee00 RDI: ffff8881233180b0
      RBP: ffffffffac38e9c0 R08: ffffffffaa67e17e R09: 0000000000000001
      R10: ffffffffb91c5557 R11: fffffbfff7238aaa R12: ffff88810a58ee00
      R13: ffff888114e77aa0 R14: 0000000000000000 R15: ffff8881233180b0
      FS:  00007f946163c480(0000) GS:ffff88839f1c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffffffffd6 CR3: 00000001099c1000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __lookup_slow+0x116/0x2d0
       ? page_put_link+0x120/0x120
       ? __d_lookup+0xfc/0x320
       ? d_lookup+0x49/0x90
       lookup_one_len+0x13c/0x170
       ? __lookup_slow+0x2d0/0x2d0
       ? reiserfs_schedule_old_flush+0x31/0x130
       reiserfs_lookup_privroot+0x64/0x150
       reiserfs_fill_super+0x158c/0x1b90
       ? finish_unfinished+0xb10/0xb10
       ? bprintf+0xe0/0xe0
       ? __mutex_lock_slowpath+0x30/0x30
       ? __kasan_check_write+0x20/0x30
       ? up_write+0x51/0xb0
       ? set_blocksize+0x9f/0x1f0
       mount_bdev+0x27c/0x2d0
       ? finish_unfinished+0xb10/0xb10
       ? reiserfs_kill_sb+0x120/0x120
       get_super_block+0x19/0x30
       legacy_get_tree+0x76/0xf0
       vfs_get_tree+0x49/0x160
       ? capable+0x1d/0x30
       path_mount+0xacc/0x1380
       ? putname+0x97/0xd0
       ? finish_automount+0x450/0x450
       ? kmem_cache_free+0xf8/0x5a0
       ? putname+0x97/0xd0
       do_mount+0xe2/0x110
       ? path_mount+0x1380/0x1380
       ? copy_mount_options+0x69/0x140
       __x64_sys_mount+0xf0/0x190
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This is because 'root_inode' is initialized with wrong mode, and
      it's i_op is set to 'reiserfs_special_inode_operations'. Thus add
      check for 'root_inode' to fix the problem.
      
      Link: https://lore.kernel.org/r/20210702040743.1918552-1-yukuai3@huawei.comSigned-off-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      26d2435f
    • M
      serial: 8250: Mask out floating 16/32-bit bus bits · bcc84b74
      Maciej W. Rozycki 提交于
      stable inclusion
      from linux-4.19.203
      commit 2c39c32f92084736bc871c1ef096602eb1cc7b5b
      
      --------------------------------
      
      commit e5227c51 upstream.
      
      Make sure only actual 8 bits of the IIR register are used in determining
      the port type in `autoconfig'.
      
      The `serial_in' port accessor returns the `unsigned int' type, meaning
      that with UPIO_AU, UPIO_MEM16, UPIO_MEM32, and UPIO_MEM32BE access types
      more than 8 bits of data are returned, of which the high order bits will
      often come from bus lines that are left floating in the data phase.  For
      example with the MIPS Malta board's CBUS UART, where the registers are
      aligned on 8-byte boundaries and which uses 32-bit accesses, data as
      follows is returned:
      
      YAMON> dump -32 0xbf000900 0x40
      
      BF000900: 1F000942 1F000942 1F000900 1F000900  ...B...B........
      BF000910: 1F000901 1F000901 1F000900 1F000900  ................
      BF000920: 1F000900 1F000900 1F000960 1F000960  ...........`...`
      BF000930: 1F000900 1F000900 1F0009FF 1F0009FF  ................
      
      YAMON>
      
      Evidently high-order 24 bits return values previously driven in the
      address phase (the 3 highest order address bits used with the command
      above are masked out in the simple virtual address mapping used here and
      come out at zeros on the external bus), a common scenario with bus lines
      left floating, due to bus capacitance.
      
      Consequently when the value of IIR, mapped at 0x1f000910, is retrieved
      in `autoconfig', it comes out at 0x1f0009c1 and when it is right-shifted
      by 6 and then assigned to 8-bit `scratch' variable, the value calculated
      is 0x27, not one of 0, 1, 2, 3 expected in port type determination.
      
      Fix the issue then, by assigning the value returned from `serial_in' to
      `scratch' first, which masks out 24 high-order bits retrieved, and only
      then right-shift the resulting 8-bit data quantity, producing the value
      of 3 in this case, as expected.  Fix the same issue in `serial_dl_read'.
      
      The problem first appeared with Linux 2.6.9-rc3 which predates our repo
      history, but the origin could be identified with the old MIPS/Linux repo
      also at: <git://git.kernel.org/pub/scm/linux/kernel/git/ralf/linux.git>
      as commit e0d2356c0777 ("Merge with Linux 2.6.9-rc3."), where code in
      `serial_in' was updated with this case:
      
      +	case UPIO_MEM32:
      +		return readl(up->port.membase + offset);
      +
      
      which made it produce results outside the unsigned 8-bit range for the
      first time, though obviously it is system dependent what actual values
      appear in the high order bits retrieved and it may well have been zeros
      in the relevant positions with the system the change originally was
      intended for.  It is at that point that code in `autoconf' should have
      been updated accordingly, but clearly it was overlooked.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Cc: stable@vger.kernel.org # v2.6.12+
      Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Signed-off-by: NMaciej W. Rozycki <macro@orcam.me.uk>
      Link: https://lore.kernel.org/r/alpine.DEB.2.21.2106260516220.37803@angie.orcam.me.ukSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bcc84b74
    • T
      ext4: fix potential htree corruption when growing large_dir directories · 912fa436
      Theodore Ts'o 提交于
      stable inclusion
      from linux-4.19.203
      commit bc1954aa8a7e195ebd686a77e81c11863ce8edbf
      
      --------------------------------
      
      commit 877ba3f7 upstream.
      
      Commit b5776e75 ("ext4: fix potential htree index checksum
      corruption) removed a required restart when multiple levels of index
      nodes need to be split.  Fix this to avoid directory htree corruptions
      when using the large_dir feature.
      
      Cc: stable@kernel.org # v5.11
      Cc: Благодаренко Артём <artem.blagodarenko@gmail.com>
      Fixes: b5776e75 ("ext4: fix potential htree index checksum corruption)
      Reported-by: NDenis <denis@voxelsoft.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      912fa436
    • A
      pipe: increase minimum default pipe size to 2 pages · 29e00e56
      Alex Xu (Hello71) 提交于
      stable inclusion
      from linux-4.19.203
      commit 76ccb26c5312760113b2b3ef6de307474e8d4b45
      
      --------------------------------
      
      commit 46c4c9d1 upstream.
      
      This program always prints 4096 and hangs before the patch, and always
      prints 8192 and exits successfully after:
      
        int main()
        {
            int pipefd[2];
            for (int i = 0; i < 1025; i++)
                if (pipe(pipefd) == -1)
                    return 1;
            size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ);
            printf("%zd\n", bufsz);
            char *buf = calloc(bufsz, 1);
            write(pipefd[1], buf, bufsz);
            read(pipefd[0], buf, bufsz-1);
            write(pipefd[1], buf, 1);
        }
      
      Note that you may need to increase your RLIMIT_NOFILE before running the
      program.
      
      Fixes: 759c0114 ("pipe: limit the per-user amount of pages allocated in pipes")
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/
      Link: https://lore.kernel.org/lkml/1628127094.lxxn016tj7.none@localhost/Signed-off-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      29e00e56
    • S
      tracing/histogram: Rename "cpu" to "common_cpu" · 237ce365
      Steven Rostedt (VMware) 提交于
      stable inclusion
      from linux-4.19.203
      commit c0add455ae992b53b0d52cd4d8682528b4014c42
      
      --------------------------------
      
      commit 1e3bac71 upstream.
      
      Currently the histogram logic allows the user to write "cpu" in as an
      event field, and it will record the CPU that the event happened on.
      
      The problem with this is that there's a lot of events that have "cpu"
      as a real field, and using "cpu" as the CPU it ran on, makes it
      impossible to run histograms on the "cpu" field of events.
      
      For example, if I want to have a histogram on the count of the
      workqueue_queue_work event on its cpu field, running:
      
       ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
      
      Gives a misleading and wrong result.
      
      Change the command to "common_cpu" as no event should have "common_*"
      fields as that's a reserved name for fields used by all events. And
      this makes sense here as common_cpu would be a field used by all events.
      
      Now we can even do:
      
       ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
       ># cat events/workqueue/workqueue_queue_work/hist
       # event histogram
       #
       # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
       #
      
       { common_cpu:          0, cpu:          2 } hitcount:          1
       { common_cpu:          0, cpu:          4 } hitcount:          1
       { common_cpu:          7, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          1 } hitcount:          1
       { common_cpu:          0, cpu:          6 } hitcount:          2
       { common_cpu:          0, cpu:          5 } hitcount:          2
       { common_cpu:          1, cpu:          1 } hitcount:          4
       { common_cpu:          6, cpu:          6 } hitcount:          4
       { common_cpu:          5, cpu:          5 } hitcount:         14
       { common_cpu:          4, cpu:          4 } hitcount:         26
       { common_cpu:          0, cpu:          0 } hitcount:         39
       { common_cpu:          2, cpu:          2 } hitcount:        184
      
      Now for backward compatibility, I added a trick. If "cpu" is used, and
      the field is not found, it will fall back to "common_cpu" and work as
      it did before. This way, it will still work for old programs that use
      "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
      will get that event's "cpu" field, which is probably what it wants
      anyway.
      
      I updated the tracefs/README to include documentation about both the
      common_timestamp and the common_cpu. This way, if that text is present in
      the README, then an application can know that common_cpu is supported over
      just plain "cpu".
      
      Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 8b7622bf ("tracing: Add cpu field for hist triggers")
      Reviewed-by: NTom Zanussi <zanussi@kernel.org>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      237ce365
    • S
      tracing / histogram: Give calculation hist_fields a size · f534ddcc
      Steven Rostedt (VMware) 提交于
      stable inclusion
      from linux-4.19.203
      commit 43cba13ff1e793c0e1e1e317c951dea63710290e
      
      --------------------------------
      
      commit 2c05caa7 upstream.
      
      When working on my user space applications, I found a bug in the synthetic
      event code where the automated synthetic event field was not matching the
      event field calculation it was attached to. Looking deeper into it, it was
      because the calculation hist_field was not given a size.
      
      The synthetic event fields are matched to their hist_fields either by
      having the field have an identical string type, or if that does not match,
      then the size and signed values are used to match the fields.
      
      The problem arose when I tried to match a calculation where the fields
      were "unsigned int". My tool created a synthetic event of type "u32". But
      it failed to match. The string was:
      
        diff=field1-field2:onmatch(event).trace(synth,$diff)
      
      Adding debugging into the kernel, I found that the size of "diff" was 0.
      And since it was given "unsigned int" as a type, the histogram fallback
      code used size and signed. The signed matched, but the size of u32 (4) did
      not match zero, and the event failed to be created.
      
      This can be worse if the field you want to match is not one of the
      acceptable fields for a synthetic event. As event fields can have any type
      that is supported in Linux, this can cause an issue. For example, if a
      type is an enum. Then there's no way to use that with any calculations.
      
      Have the calculation field simply take on the size of what it is
      calculating.
      
      Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 100719dc ("tracing: Add simple expression support to hist triggers")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f534ddcc
    • Y
      blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit() · af53f3e4
      Yu Kuai 提交于
      stable inclusion
      from linux-4.19.203
      commit 76ab02d9b861da0785176f0228340f22023902fa
      
      --------------------------------
      
      [ Upstream commit 8d75d0ef ]
      
      If queue is dying while iolatency_set_limit() is in progress,
      blk_get_queue() won't increment the refcount of the queue. However,
      blk_put_queue() will still decrement the refcount later, which will
      cause the refcout to be unbalanced.
      
      Thus error out in such case to fix the problem.
      
      Fixes: 8c772a9b ("blk-iolatency: fix IO hang due to negative inflight counter")
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      af53f3e4
    • P
      net: Fix zero-copy head len calculation. · bbba8eec
      Pravin B Shelar 提交于
      stable inclusion
      from linux-4.19.202
      commit a66fdcda469a0e103fe105dc0c95536fa28dc733
      
      --------------------------------
      
      [ Upstream commit a17ad096 ]
      
      In some cases skb head could be locked and entire header
      data is pulled from skb. When skb_zerocopy() called in such cases,
      following BUG is triggered. This patch fixes it by copying entire
      skb in such cases.
      This could be optimized incase this is performance bottleneck.
      
      ---8<---
      kernel BUG at net/core/skbuff.c:2961!
      invalid opcode: 0000 [#1] SMP PTI
      CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           OE     5.4.0-77-generic #86-Ubuntu
      Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:skb_zerocopy+0x37a/0x3a0
      RSP: 0018:ffffbcc70013ca38 EFLAGS: 00010246
      Call Trace:
       <IRQ>
       queue_userspace_packet+0x2af/0x5e0 [openvswitch]
       ovs_dp_upcall+0x3d/0x60 [openvswitch]
       ovs_dp_process_packet+0x125/0x150 [openvswitch]
       ovs_vport_receive+0x77/0xd0 [openvswitch]
       netdev_port_receive+0x87/0x130 [openvswitch]
       netdev_frame_hook+0x4b/0x60 [openvswitch]
       __netif_receive_skb_core+0x2b4/0xc90
       __netif_receive_skb_one_core+0x3f/0xa0
       __netif_receive_skb+0x18/0x60
       process_backlog+0xa9/0x160
       net_rx_action+0x142/0x390
       __do_softirq+0xe1/0x2d6
       irq_exit+0xae/0xb0
       do_IRQ+0x5a/0xf0
       common_interrupt+0xf/0xf
      
      Code that triggered BUG:
      int
      skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
      {
              int i, j = 0;
              int plen = 0; /* length of skb->head fragment */
              int ret;
              struct page *page;
              unsigned int offset;
      
              BUG_ON(!from->head_frag && !hlen);
      Signed-off-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bbba8eec
    • P
      netfilter: nft_nat: allow to specify layer 4 protocol NAT only · 1a53a31c
      Pablo Neira Ayuso 提交于
      stable inclusion
      from linux-4.19.201
      commit 1cb5995a39eb3dc97a7539d00d2c82be030e0bb8
      
      --------------------------------
      
      [ Upstream commit a33f387e ]
      
      nft_nat reports a bogus EAFNOSUPPORT if no layer 3 information is specified.
      
      Fixes: d07db988 ("netfilter: nf_tables: introduce nft_validate_register_load()")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1a53a31c
    • F
      netfilter: conntrack: adjust stop timestamp to real expiry value · cbc95e28
      Florian Westphal 提交于
      stable inclusion
      from linux-4.19.201
      commit 512fd52e2091560de66da26799b3f1ca7ca1d41b
      
      --------------------------------
      
      [ Upstream commit 30a56a2b ]
      
      In case the entry is evicted via garbage collection there is
      delay between the timeout value and the eviction event.
      
      This adjusts the stop value based on how much time has passed.
      
      Fixes: b87a2f91 ("netfilter: conntrack: add gc worker to remove timed-out entries")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cbc95e28
    • E
      virtio_net: Do not pull payload in skb->head · caee2338
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.201
      commit 16851e34b621bc7e652c508bb28c47948fb86958
      
      --------------------------------
      
      commit 0f6925b3 upstream.
      
      Xuan Zhuo reported that commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") brought  a ~10% performance drop.
      
      The reason for the performance drop was that GRO was forced
      to chain sk_buff (using skb_shinfo(skb)->frag_list), which
      uses more memory but also cause packet consumers to go over
      a lot of overhead handling all the tiny skbs.
      
      It turns out that virtio_net page_to_skb() has a wrong strategy :
      It allocates skbs with GOOD_COPY_LEN (128) bytes in skb->head, then
      copies 128 bytes from the page, before feeding the packet to GRO stack.
      
      This was suboptimal before commit 3226b158 ("net: avoid 32 x truesize
      under-estimation for tiny skbs") because GRO was using 2 frags per MSS,
      meaning we were not packing MSS with 100% efficiency.
      
      Fix is to pull only the ethernet header in page_to_skb()
      
      Then, we change virtio_net_hdr_to_skb() to pull the missing
      headers, instead of assuming they were already pulled by callers.
      
      This fixes the performance regression, but could also allow virtio_net
      to accept packets with more than 128bytes of headers.
      
      Many thanks to Xuan Zhuo for his report, and his tests/help.
      
      Fixes: 3226b158 ("net: avoid 32 x truesize under-estimation for tiny skbs")
      Reported-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Link: https://www.spinics.net/lists/netdev/msg731397.htmlCo-Developed-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      caee2338
    • Y
      virtio_net: Add XDP meta data support · 69073eb1
      Yuya Kusakabe 提交于
      stable inclusion
      from linux-4.19.187
      commit 15f135b4ea6e31215d184ef26d0bbb44e1cbe9f5
      
      --------------------------------
      
      [ Upstream commit 503d539a ]
      
      Implement support for transferring XDP meta data into skb for
      virtio_net driver; before calling into the program, xdp.data_meta points
      to xdp.data, where on program return with pass verdict, we call
      into skb_metadata_set().
      
      Tested with the script at
      https://github.com/higebu/virtio_net-xdp-metadata-test.
      Signed-off-by: NYuya Kusakabe <yuya.kusakabe@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Link: https://lore.kernel.org/bpf/20200225033212.437563-2-yuya.kusakabe@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      69073eb1
    • W
      net: check untrusted gso_size at kernel entry · dde45c86
      Willem de Bruijn 提交于
      stable inclusion
      from linux-4.19.128
      commit 8920e8ae16a89bebd4d98ec6c7b306b1e3e06722
      
      --------------------------------
      
      [ Upstream commit 6dd912f8 ]
      
      Syzkaller again found a path to a kernel crash through bad gso input:
      a packet with gso size exceeding len.
      
      These packets are dropped in tcp_gso_segment and udp[46]_ufo_fragment.
      But they may affect gso size calculations earlier in the path.
      
      Now that we have thlen as of commit 9274124f ("net: stricter
      validation of untrusted gso packets"), check gso_size at entry too.
      
      Fixes: bfd5f4a3 ("packet: Add GSO/csum offload support.")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      dde45c86
    • X
      sctp: move 198 addresses from unusable to private scope · c0a17e35
      Xin Long 提交于
      stable inclusion
      from linux-4.19.200
      commit 53012dd6ca2f3c9420b5cc447279375a90290fb4
      
      --------------------------------
      
      [ Upstream commit 1d11fa23 ]
      
      The doc draft-stewart-tsvwg-sctp-ipv4-00 that restricts 198 addresses
      was never published. These addresses as private addresses should be
      allowed to use in SCTP.
      
      As Michael Tuexen suggested, this patch is to move 198 addresses from
      unusable to private scope.
      Reported-by: NSérgio <surkamp@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c0a17e35
    • E
      net: annotate data race around sk_ll_usec · f785ece9
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.200
      commit c1a5cd807960d07381364c7b05aa3a43eb6d3a2f
      
      --------------------------------
      
      [ Upstream commit 0dbffbb5 ]
      
      sk_ll_usec is read locklessly from sk_can_busy_loop()
      while another thread can change its value in sock_setsockopt()
      
      This is correct but needs annotations.
      
      BUG: KCSAN: data-race in __skb_try_recv_datagram / sock_setsockopt
      
      write to 0xffff88814eb5f904 of 4 bytes by task 14011 on cpu 0:
       sock_setsockopt+0x1287/0x2090 net/core/sock.c:1175
       __sys_setsockopt+0x14f/0x200 net/socket.c:2100
       __do_sys_setsockopt net/socket.c:2115 [inline]
       __se_sys_setsockopt net/socket.c:2112 [inline]
       __x64_sys_setsockopt+0x62/0x70 net/socket.c:2112
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88814eb5f904 of 4 bytes by task 14001 on cpu 1:
       sk_can_busy_loop include/net/busy_poll.h:41 [inline]
       __skb_try_recv_datagram+0x14f/0x320 net/core/datagram.c:273
       unix_dgram_recvmsg+0x14c/0x870 net/unix/af_unix.c:2101
       unix_seqpacket_recvmsg+0x5a/0x70 net/unix/af_unix.c:2067
       ____sys_recvmsg+0x15d/0x310 include/linux/uio.h:244
       ___sys_recvmsg net/socket.c:2598 [inline]
       do_recvmmsg+0x35c/0x9f0 net/socket.c:2692
       __sys_recvmmsg net/socket.c:2771 [inline]
       __do_sys_recvmmsg net/socket.c:2794 [inline]
       __se_sys_recvmmsg net/socket.c:2787 [inline]
       __x64_sys_recvmmsg+0xcf/0x150 net/socket.c:2787
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00000000 -> 0x00000101
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 14001 Comm: syz-executor.3 Not tainted 5.13.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f785ece9
    • Y
      net/802/garp: fix memleak in garp_request_join() · 96a28d8d
      Yang Yingliang 提交于
      stable inclusion
      from linux-4.19.200
      commit e954107513e5e984821591b9b0ee4b002fcb63c6
      
      --------------------------------
      
      [ Upstream commit 42ca63f9 ]
      
      I got kmemleak report when doing fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff88810c909b80 (size 64):
        comm "syz", pid 957, jiffies 4295220394 (age 399.090s)
        hex dump (first 32 bytes):
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 08 00 00 00 01 02 00 04  ................
        backtrace:
          [<00000000ca1f2e2e>] garp_request_join+0x285/0x3d0
          [<00000000bf153351>] vlan_gvrp_request_join+0x15b/0x190
          [<0000000024005e72>] vlan_dev_open+0x706/0x980
          [<00000000dc20c4d4>] __dev_open+0x2bb/0x460
          [<0000000066573004>] __dev_change_flags+0x501/0x650
          [<0000000035b42f83>] rtnl_configure_link+0xee/0x280
          [<00000000a5e69de0>] __rtnl_newlink+0xed5/0x1550
          [<00000000a5258f4a>] rtnl_newlink+0x66/0x90
          [<00000000506568ee>] rtnetlink_rcv_msg+0x439/0xbd0
          [<00000000b7eaeae1>] netlink_rcv_skb+0x14d/0x420
          [<00000000c373ce66>] netlink_unicast+0x550/0x750
          [<00000000ec74ce74>] netlink_sendmsg+0x88b/0xda0
          [<00000000381ff246>] sock_sendmsg+0xc9/0x120
          [<000000008f6a2db3>] ____sys_sendmsg+0x6e8/0x820
          [<000000008d9c1735>] ___sys_sendmsg+0x145/0x1c0
          [<00000000aa39dd8b>] __sys_sendmsg+0xfe/0x1d0
      
      Calling garp_request_leave() after garp_request_join(), the attr->state
      is set to GARP_APPLICANT_VO, garp_attr_destroy() won't be called in last
      transmit event in garp_uninit_applicant(), the attr of applicant will be
      leaked. To fix this leak, iterate and free each attr of applicant before
      rerturning from garp_uninit_applicant().
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      96a28d8d
    • Y
      net/802/mrp: fix memleak in mrp_request_join() · 4c2871d4
      Yang Yingliang 提交于
      stable inclusion
      from linux-4.19.200
      commit f9dd1e4e9d39e799fbe2be9ac7e6b43a9567ff8c
      
      --------------------------------
      
      [ Upstream commit 996af621 ]
      
      I got kmemleak report when doing fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff88810c239500 (size 64):
      comm "syz-executor940", pid 882, jiffies 4294712870 (age 14.631s)
      hex dump (first 32 bytes):
      01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      00 00 00 00 00 00 00 00 01 00 00 00 01 02 00 04 ................
      backtrace:
      [<00000000a323afa4>] slab_alloc_node mm/slub.c:2972 [inline]
      [<00000000a323afa4>] slab_alloc mm/slub.c:2980 [inline]
      [<00000000a323afa4>] __kmalloc+0x167/0x340 mm/slub.c:4130
      [<000000005034ca11>] kmalloc include/linux/slab.h:595 [inline]
      [<000000005034ca11>] mrp_attr_create net/802/mrp.c:276 [inline]
      [<000000005034ca11>] mrp_request_join+0x265/0x550 net/802/mrp.c:530
      [<00000000fcfd81f3>] vlan_mvrp_request_join+0x145/0x170 net/8021q/vlan_mvrp.c:40
      [<000000009258546e>] vlan_dev_open+0x477/0x890 net/8021q/vlan_dev.c:292
      [<0000000059acd82b>] __dev_open+0x281/0x410 net/core/dev.c:1609
      [<000000004e6dc695>] __dev_change_flags+0x424/0x560 net/core/dev.c:8767
      [<00000000471a09af>] rtnl_configure_link+0xd9/0x210 net/core/rtnetlink.c:3122
      [<0000000037a4672b>] __rtnl_newlink+0xe08/0x13e0 net/core/rtnetlink.c:3448
      [<000000008d5d0fda>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488
      [<000000004882fe39>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5552
      [<00000000907e6c54>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
      [<00000000e7d7a8c4>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
      [<00000000e7d7a8c4>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
      [<00000000e0645d50>] netlink_sendmsg+0x78e/0xc90 net/netlink/af_netlink.c:1929
      [<00000000c24559b7>] sock_sendmsg_nosec net/socket.c:654 [inline]
      [<00000000c24559b7>] sock_sendmsg+0x139/0x170 net/socket.c:674
      [<00000000fc210bc2>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
      [<00000000be4577b5>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404
      
      Calling mrp_request_leave() after mrp_request_join(), the attr->state
      is set to MRP_APPLICANT_VO, mrp_attr_destroy() won't be called in last
      TX event in mrp_uninit_applicant(), the attr of applicant will be leaked.
      To fix this leak, iterate and free each attr of applicant before rerturning
      from mrp_uninit_applicant().
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4c2871d4
    • M
      af_unix: fix garbage collect vs MSG_PEEK · faa64eb4
      Miklos Szeredi 提交于
      stable inclusion
      from linux-4.19.200
      commit 1dabafa9f61118b1377fde424d9a94bf8dbf2813
      
      --------------------------------
      
      commit cbcf0112 upstream.
      
      unix_gc() assumes that candidate sockets can never gain an external
      reference (i.e.  be installed into an fd) while the unix_gc_lock is
      held.  Except for MSG_PEEK this is guaranteed by modifying inflight
      count under the unix_gc_lock.
      
      MSG_PEEK does not touch any variable protected by unix_gc_lock (file
      count is not), yet it needs to be serialized with garbage collection.
      Do this by locking/unlocking unix_gc_lock:
      
       1) increment file count
      
       2) lock/unlock barrier to make sure incremented file count is visible
          to garbage collection
      
       3) install file into fd
      
      This is a lock barrier (unlike smp_mb()) that ensures that garbage
      collection is run completely before or completely after the barrier.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      faa64eb4
  2. 22 10月, 2021 2 次提交