1. 02 9月, 2020 11 次提交
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 9bcadc70
      Mel Gorman 提交于
      to #28825456
      
      commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream.
      
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      9bcadc70
    • J
      filemap: kill page_cache_read usage in filemap_fault · d33d6167
      Josef Bacik 提交于
      to #28718400
      
      commit a75d4c33377277b6034dd1e2663bce444f952c14 upstream.
      
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	mm/filemap.c
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d33d6167
    • J
      arm_pmu: acpi: spe: Add initial MADT/SPE probing · f1c96d2b
      Jeremy Linton 提交于
      fix #26734090
      
      commit d24a0c7099b32b6981d7f126c45348e381718350 upstream
      
      ACPI 6.3 adds additional fields to the MADT GICC
      structure to describe SPE PPI's. We pick these out
      of the cached reference to the madt_gicc structure
      similarly to the core PMU code. We then create a platform
      device referring to the IRQ and let the user/module loader
      decide whether to load the SPE driver.
      Tested-by: NHanjun Guo <hanjun.guo@linaro.org>
      Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
      Reviewed-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      f1c96d2b
    • M
      arm64: Relax ICC_PMR_EL1 accesses when ICC_CTLR_EL1.PMHE is clear · 625b8a72
      Marc Zyngier 提交于
      task #25552995
      
      commit f226650494c6aa87526d12135b7de8b8c074f3de upstream.
      
      The GICv3 architecture specification is incredibly misleading when it
      comes to PMR and the requirement for a DSB. It turns out that this DSB
      is only required if the CPU interface sends an Upstream Control
      message to the redistributor in order to update the RD's view of PMR.
      
      This message is only sent when ICC_CTLR_EL1.PMHE is set, which isn't
      the case in Linux. It can still be set from EL3, so some special care
      is required. But the upshot is that in the (hopefuly large) majority
      of the cases, we can drop the DSB altogether.
      
      This relies on a new static key being set if the boot CPU has PMHE
      set. The drawback is that this static key has to be exported to
      modules.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      625b8a72
    • J
      efi: Let architectures decide the flags that should be saved/restored · 1c2e583b
      Julien Thierry 提交于
      task #25552995
      
      commit 13b210ddf474d9f3368766008a89fe82a6f90b48 upstream
      
      Currently, irqflags are saved before calling runtime services and
      checked for mismatch on return.
      
      Provide a pair of overridable macros to save and restore (if needed) the
      state that need to be preserved on return from a runtime service.
      This allows to check for flags that are not necesarly related to
      irqflags.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: linux-efi@vger.kernel.org
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      1c2e583b
    • J
      irqdesc: Add domain handler for NMIs · d0cd7461
      Julien Thierry 提交于
      task #25552995
      
      commit 6e4933a006616343f66c4702dc4fc56bb25e7b02 upstream
      
      NMI handling code should be executed between calls to nmi_enter and
      nmi_exit.
      
      Add a separate domain handler to properly setup NMI context when handling
      an interrupt requested as NMI.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      d0cd7461
    • J
      genirq: Provide NMI handlers · 3035257e
      Julien Thierry 提交于
      task #25552995
      
      commit 2dcf1fbcad352baaa5f47b17e57c5743c8eedbad upstream
      
      Provide flow handlers that are NMI safe for interrupts and percpu_devid
      interrupts.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      3035257e
    • J
      genirq: Provide NMI management for percpu_devid interrupts · 3bd2aa7b
      Julien Thierry 提交于
      task #25552995
      
      commit 4b078c3f1a26487c39363089ba0d5c6b09f2a89f upstream
      
      Add support for percpu_devid interrupts treated as NMIs.
      
      Percpu_devid NMIs need to be setup/torn down on each CPU they target.
      
      The same restrictions as for global NMIs still apply for percpu_devid NMIs.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      
      [ Shile: fixed conflicts in include/linux/interrupt.h ]
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      3bd2aa7b
    • J
      genirq: Provide basic NMI management for interrupt lines · 2e186fb5
      Julien Thierry 提交于
      task #25552995
      
      commit b525903c254dab2491410f0f23707691b7c2c317 upstream
      
      Add functionality to allocate interrupt lines that will deliver IRQs
      as Non-Maskable Interrupts. These allocations are only successful if
      the irqchip provides the necessary support and allows NMI delivery for the
      interrupt line.
      
      Interrupt lines allocated for NMI delivery must be enabled/disabled through
      enable_nmi/disable_nmi_nosync to keep their state consistent.
      
      To treat a PERCPU IRQ as NMI, the interrupt must not be shared nor threaded,
      the irqchip directly managing the IRQ must be the root irqchip and the
      irqchip cannot be behind a slow bus.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      2e186fb5
    • J
      irqchip/gic: Unify GIC priority definitions · 8fdd8aec
      Julien Thierry 提交于
      task #25552995
      
      commit 2130b789b3ef6a518b9c9c6f245642620e2b0c0c upstream.
      
      LPIs use the same priority value as other GIC interrupts.
      
      Make the GIC default priority definition visible to ITS implementation
      and use this same definition for LPI priorities.
      Tested-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      8fdd8aec
    • R
      iommu: Add fast hook for getting DMA domains · 73559ad3
      Robin Murphy 提交于
      fix #27432135
      
      commit 6af588fed39178c8e118fcf9cb6664e58a1fbe88 upstream
      
      While iommu_get_domain_for_dev() is the robust way for arbitrary IOMMU
      API callers to retrieve the domain pointer, for DMA ops domains it
      doesn't scale well for large systems and multi-queue devices, since the
      momentary refcount adjustment will lead to exclusive cacheline contention
      when multiple CPUs are operating in parallel on different mappings for
      the same device.
      
      In the case of DMA ops domains, however, this refcounting is actually
      unnecessary, since they already imply that the group exists and is
      managed by platform code and IOMMU internals (by virtue of
      iommu_group_get_for_dev()) such that a reference will already be held
      for the lifetime of the device. Thus we can avoid the bottleneck by
      providing a fast lookup specifically for the DMA code to retrieve the
      default domain it already knows it has set up - a simple read-only
      dereference plays much nicer with cache-coherency protocols.
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com>
      Acked-by: Nzou cao <zoucao@linux.alibaba.com>
      73559ad3
  2. 29 6月, 2020 10 次提交
  3. 24 6月, 2020 3 次提交
  4. 23 6月, 2020 3 次提交
    • Y
      alinux: sched: Add switch for scheduler_tick load tracking · bcaf8afd
      Yihao Wu 提交于
      to #28739709
      
      Assume workloads are composed of massive short tasks. Then periodical
      load tracking is unnecessary. Because load tracking should be already
      guaranteed by frequent sleep and wake-up.
      
      If these massive short tasks run in their individual cgroups, the load
      tracking becomes extremely heavy.
      
      This patch adds a switch to bypass scheduler_tick load tracking, in
      order to reduce scheduler overhead, without sacrificing much balance
      in this scenario.
      
      Performance Tests:
      
      1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine
      
      	sched overhead(each HT): 0.74% -> 0.48%
      
      	(This test's baseline is from the previous patch)
      
      2) sysbench-threads with 96 threads, running for 5min
      
      	latency_ms 95th: 63.07 -> 54.01
      
      Besides these, no regression is found on our test platform.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      bcaf8afd
    • Y
      alinux: sched: Add switch for update_blocked_averages · bb48b716
      Yihao Wu 提交于
      to #28739709
      
      Unless the workloads are IO-bounded, update_blocked_averages doesn't help
      load balance. This patch adds a switch to bypass update_blocked_averages
      if prior knowledge about workloads indicates IO is negligible.
      
      Performance Tests:
      
      1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine
      
      	sched overhead(each HT): 3.78% -> 0.74%
      
      2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake
      
      	overhead: 21.06 -> 18.08
      
      3) unixbench context1 with 96 threads running for 1min
      
      	Score: 15409.40 -> 16821.77
      
      Besides these, UnixBench has some performance ups and downs. But
      generally, the performance of UnixBench hasn't changed.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      bb48b716
    • Y
      alinux: mm: thp: add fast_cow switch · 56a432f5
      Yang Shi 提交于
      task #27327988
      
      The commit ("thp: change CoW semantics for anon-THP") rewrites THP CoW
      page fault handler to allocate base page only, but there is request to
      keep the old behavior just in case.  So, introduce a new sysfs knob,
      fast_cow, to control the behavior, the default is the new behavior.
      Write that knob to 0 to switch to old behavior.
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: fix checkpatch.pl warnings ]
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      56a432f5
  5. 15 6月, 2020 2 次提交
    • S
      block: Fix use-after-free issue accessing struct io_cq · fba123ba
      Sahitya Tummala 提交于
      task #28557799
      
      [ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]
      
      There is a potential race between ioc_release_fn() and
      ioc_clear_queue() as shown below, due to which below kernel
      crash is observed. It also can result into use-after-free
      issue.
      
      context#1:				context#2:
      ioc_release_fn()			__ioc_clear_queue() gets the same icq
      ->spin_lock(&ioc->lock);		->spin_lock(&ioc->lock);
      ->ioc_destroy_icq(icq);
        ->list_del_init(&icq->q_node);
        ->call_rcu(&icq->__rcu_head,
        	icq_free_icq_rcu);
      ->spin_unlock(&ioc->lock);
      					->ioc_destroy_icq(icq);
      					  ->hlist_del_init(&icq->ioc_node);
      					  This results into below crash as this memory
      					  is now used by icq->__rcu_head in context#1.
      					  There is a chance that icq could be free'd
      					  as well.
      
      22150.386550:   <6> Unable to handle kernel write to read-only memory
      at virtual address ffffffaa8d31ca50
      ...
      Call trace:
      22150.607350:   <2>  ioc_destroy_icq+0x44/0x110
      22150.611202:   <2>  ioc_clear_queue+0xac/0x148
      22150.615056:   <2>  blk_cleanup_queue+0x11c/0x1a0
      22150.619174:   <2>  __scsi_remove_device+0xdc/0x128
      22150.623465:   <2>  scsi_forget_host+0x2c/0x78
      22150.627315:   <2>  scsi_remove_host+0x7c/0x2a0
      22150.631257:   <2>  usb_stor_disconnect+0x74/0xc8
      22150.635371:   <2>  usb_unbind_interface+0xc8/0x278
      22150.639665:   <2>  device_release_driver_internal+0x198/0x250
      22150.644897:   <2>  device_release_driver+0x24/0x30
      22150.649176:   <2>  bus_remove_device+0xec/0x140
      22150.653204:   <2>  device_del+0x270/0x460
      22150.656712:   <2>  usb_disable_device+0x120/0x390
      22150.660918:   <2>  usb_disconnect+0xf4/0x2e0
      22150.664684:   <2>  hub_event+0xd70/0x17e8
      22150.668197:   <2>  process_one_work+0x210/0x480
      22150.672222:   <2>  worker_thread+0x32c/0x4c8
      
      Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
      indicate this icq is once marked as destroyed. Also, ensure
      __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
      that icq doesn't get free'd up while it is still using it.
      Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
      Co-developed-by: NPradeep P V K <ppvk@codeaurora.org>
      Signed-off-by: NPradeep P V K <ppvk@codeaurora.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fba123ba
    • M
      block: fix an integer overflow in logical block size · 8b05616d
      Mikulas Patocka 提交于
      task #28557799
      
      commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.
      
      Logical block size has type unsigned short. That means that it can be at
      most 32768. However, there are architectures that can run with 64k pages
      (for example arm64) and on these architectures, it may be possible to
      create block devices with 64k block size.
      
      For exmaple (run this on an architecture with 64k pages):
      
      Mount will fail with this error because it tries to read the superblock using 2-sector
      access:
        device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
        EXT4-fs (dm-0): unable to read superblock
      
      This patch changes the logical block size from unsigned short to unsigned
      int to avoid the overflow.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8b05616d
  6. 11 6月, 2020 1 次提交
    • J
      alinux: blk-mq: remove QUEUE_FLAG_POLL from default MQ flags · 294d5fb2
      Joseph Qi 提交于
      fix #28528017
      
      In case of virtio-blk device, checking /sys/block/<device>/queue/io_poll
      will show 1 and user can't disable it. Actually virtio-blk doesn't
      support poll yet, so it will confuse end user. The root cause is mq
      initialization will default set bit QUEUE_FLAG_POLL.
      
      This fix takes ideas from the following upstream commits:
      6544d229bf43 ("block: enable polling by default if a poll map is initalized")
      6e0de61107f0 ("blk-mq: remove QUEUE_FLAG_POLL from default MQ flags")
      Since we don't want to get HCTX_TYPE_POLL related logic involved, so
      just check mq_ops->poll and then set QUEUE_FLAG_POLL.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      294d5fb2
  7. 09 6月, 2020 5 次提交
  8. 05 6月, 2020 2 次提交
  9. 04 6月, 2020 2 次提交
  10. 28 5月, 2020 1 次提交