1. 30 10月, 2019 40 次提交
    • R
      iommu/io-pgtable-arm-v7s: Add support for non-strict mode · 5fb01cd3
      Robin Murphy 提交于
      commit b2dfeba654cb08db327d0ed4547b66c2f8fce997 upstream
      
      As for LPAE, it's simply a case of skipping the leaf invalidation for a
      regular unmap, and ensuring that the one in split_blk_unmap() is paired
      with an explicit sync ASAP rather than relying on one which might only
      eventually happen way down the line.
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      5fb01cd3
    • Z
      iommu/arm-smmu-v3: Add support for non-strict mode · 8bb6210c
      Zhen Lei 提交于
      commit 9662b99a19abccb0b7bfc91abb3fec1447c35bf0 upstream
      
      Now that io-pgtable knows how to dodge strict TLB maintenance, all
      that's left to do is bridge the gap between the IOMMU core requesting
      DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE for default domains, and showing the
      appropriate IO_PGTABLE_QUIRK_NON_STRICT flag to alloc_io_pgtable_ops().
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      [rm: convert to domain attribute, tweak commit message]
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      8bb6210c
    • Z
      iommu/io-pgtable-arm: Add support for non-strict mode · 2576a534
      Zhen Lei 提交于
      commit b6b65ca20bc93d14319f9b5cf98fd3c19a4244e3 upstream
      
      Non-strict mode is simply a case of skipping 'regular' leaf TLBIs, since
      the sync is already factored out into ops->iotlb_sync at the core API
      level. Non-leaf invalidations where we change the page table structure
      itself still have to be issued synchronously in order to maintain walk
      caches correctly.
      
      To save having to reason about it too much, make sure the invalidation
      in arm_lpae_split_blk_unmap() just performs its own unconditional sync
      to minimise the window in which we're technically violating the break-
      before-make requirement on a live mapping. This might work out redundant
      with an outer-level sync for strict unmaps, but we'll never be splitting
      blocks on a DMA fastpath anyway.
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      [rm: tweak comment, commit message, split_blk_unmap logic and barriers]
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      2576a534
    • Z
      iommu: Add "iommu.strict" command line option · 2757989e
      Zhen Lei 提交于
      commit 68a6efe86f6a16e25556a2aff40efad41097b486 upstream
      
      Add a generic command line option to enable lazy unmapping via IOVA
      flush queues, which will initally be suuported by iommu-dma. This echoes
      the semantics of "intel_iommu=strict" (albeit with the opposite default
      value), but in the driver-agnostic fashion of "iommu.passthrough".
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      [rm: move handling out of SMMUv3 driver, clean up documentation]
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      [will: dropped broken printk when parsing command-line option]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      2757989e
    • Z
      iommu/dma: Add support for non-strict mode · c6801dcd
      Zhen Lei 提交于
      commit 2da274cdf998a1c12afa6b5975db2df1df01edf1 upstream
      
      With the flush queue infrastructure already abstracted into IOVA
      domains, hooking it up in iommu-dma is pretty simple. Since there is a
      degree of dependency on the IOMMU driver knowing what to do to play
      along, we key the whole thing off a domain attribute which will be set
      on default DMA ops domains to request non-strict invalidation. That way,
      drivers can indicate the appropriate support by acknowledging the
      attribute, and we can easily fall back to strict invalidation otherwise.
      
      The flush queue callback needs a handle on the iommu_domain which owns
      our cookie, so we have to add a pointer back to that, but neatly, that's
      also sufficient to indicate whether we're using a flush queue or not,
      and thus which way to release IOVAs. The only slight subtlety is
      switching __iommu_dma_unmap() from calling iommu_unmap() to explicit
      iommu_unmap_fast()/iommu_tlb_sync() so that we can elide the sync
      entirely in non-strict mode.
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      [rm: convert to domain attribute, tweak comments and commit message]
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      c6801dcd
    • Z
      iommu/arm-smmu-v3: Implement flush_iotlb_all hook · 73afcbcd
      Zhen Lei 提交于
      commit 07fdef34d2be6811f00c6f9e4e2a1483cf86696c upstream
      
      .flush_iotlb_all is currently stubbed to arm_smmu_iotlb_sync() since the
      only time it would ever need to actually do anything is for callers
      doing their own explicit batching, e.g.:
      
      	iommu_unmap_fast(domain, ...);
      	iommu_unmap_fast(domain, ...);
      	iommu_iotlb_flush_all(domain, ...);
      
      where since io-pgtable still issues the TLBI commands implicitly in the
      unmap instead of implementing .iotlb_range_add, the "flush" only needs
      to ensure completion of those already-in-flight invalidations.
      
      However, we're about to start using it in anger with flush queues, so
      let's get a proper implementation wired up.
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      [rm: document why it wasn't a bug]
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      73afcbcd
    • Z
      iommu/arm-smmu-v3: Avoid back-to-back CMD_SYNC operations · 39cf7f88
      Zhen Lei 提交于
      commit 901510ee32f7190902f6fe4affb463e5d86a804c upstream
      
      Putting adjacent CMD_SYNCs into the command queue is nonsensical, but
      can happen when multiple CPUs are inserting commands. Rather than leave
      the poor old hardware to chew through these operations, we can instead
      drop the subsequent SYNCs and poll for completion of the first. This
      has been shown to improve IO performance under pressure, where the
      number of SYNC operations reduces by about a third:
      
      	CMD_SYNCs reduced:	19542181
      	CMD_SYNCs total:	58098548	(include reduced)
      	CMDs total:		116197099	(TLBI:SYNC about 1:1)
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      39cf7f88
    • Z
      iommu/arm-smmu-v3: Fix unexpected CMD_SYNC timeout · 8c02158d
      Zhen Lei 提交于
      commit 0f02477d16980938a84aba8688a4e3a303306116 upstream
      
      The condition break condition of:
      
      	(int)(VAL - sync_idx) >= 0
      
      in the __arm_smmu_sync_poll_msi() polling loop requires that sync_idx
      must be increased monotonically according to the sequence of the CMDs in
      the cmdq.
      
      However, since the msidata is populated using atomic_inc_return_relaxed()
      before taking the command-queue spinlock, then the following scenario
      can occur:
      
      CPU0			CPU1
      msidata=0
      			msidata=1
      			insert cmd1
      insert cmd0
      			smmu execute cmd1
      smmu execute cmd0
      			poll timeout, because msidata=1 is overridden by
      			cmd0, that means VAL=0, sync_idx=1.
      
      This is not a functional problem, since the caller will eventually either
      timeout or exit due to another CMD_SYNC, however it's clearly not what
      the code is supposed to be doing. Fix it, by incrementing the sequence
      count with the command-queue lock held, allowing us to drop the atomic
      operations altogether.
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      [will: dropped the specialised cmd building routine for now]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      8c02158d
    • R
      iommu/io-pgtable-arm: Fix race handling in split_blk_unmap() · 0c9e76ff
      Robin Murphy 提交于
      commit 85c7a0f1ef624ef58173ef52ea77780257bdfe04 upstream
      
      In removing the pagetable-wide lock, we gained the possibility of the
      vanishingly unlikely case where we have a race between two concurrent
      unmappers splitting the same block entry. The logic to handle this is
      fairly straightforward - whoever loses the race frees their partial
      next-level table and instead dereferences the winner's newly-installed
      entry in order to fall back to a regular unmap, which intentionally
      echoes the pre-existing case of recursively splitting a 1GB block down
      to 4KB pages by installing a full table of 2MB blocks first.
      
      Unfortunately, the chump who implemented that logic failed to update the
      condition check for that fallback, meaning that if said race occurs at
      the last level (where the loser's unmap_idx is valid) then the unmap
      won't actually happen. Fix that to properly account for both the race
      and recursive cases.
      
      Fixes: 2c3d273e ("iommu/io-pgtable-arm: Support lockless operation")
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      [will: re-jig control flow to avoid duplicate cmpxchg test]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      0c9e76ff
    • J
      iommu/arm-smmu-v3: Fix a couple of minor comment typos · b5c2f25c
      John Garry 提交于
      commit 657135f3108122556c3cf60a78c6f0e76aeb60e6 commit
      
      Fix some comment typos spotted.
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
      b5c2f25c
    • M
      sched/psi: Correct overly pessimistic size calculation · 0801de60
      Miles Chen 提交于
      commit 4adcdcea717cb2d8436bef00dd689aa5bc76f11b upstream.
      
      When passing a equal or more then 32 bytes long string to psi_write(),
      psi_write() copies 31 bytes to its buf and overwrites buf[30]
      with '\0'. Which makes the input string 1 byte shorter than
      it should be.
      
      Fix it by copying sizeof(buf) bytes when nbytes >= sizeof(buf).
      
      This does not cause problems in normal use case like:
      "some 500000 10000000" or "full 500000 10000000" because they
      are less than 32 bytes in length.
      
      	/* assuming nbytes == 35 */
      	char buf[32];
      
      	buf_size = min(nbytes, (sizeof(buf) - 1)); /* buf_size = 31 */
      	if (copy_from_user(buf, user_buf, buf_size))
      		return -EFAULT;
      
      	buf[buf_size - 1] = '\0'; /* buf[30] = '\0' */
      
      Before:
      
       %cd /proc/pressure/
       %echo "123456789|123456789|123456789|1234" > memory
       [   22.473497] nbytes=35,buf_size=31
       [   22.473775] 123456789|123456789|123456789| (print 30 chars)
       %sh: write error: Invalid argument
      
       %echo "123456789|123456789|123456789|1" > memory
       [   64.916162] nbytes=32,buf_size=31
       [   64.916331] 123456789|123456789|123456789| (print 30 chars)
       %sh: write error: Invalid argument
      
      After:
      
       %cd /proc/pressure/
       %echo "123456789|123456789|123456789|1234" > memory
       [  254.837863] nbytes=35,buf_size=32
       [  254.838541] 123456789|123456789|123456789|1 (print 31 chars)
       %sh: write error: Invalid argument
      
       %echo "123456789|123456789|123456789|1" > memory
       [ 9965.714935] nbytes=32,buf_size=32
       [ 9965.715096] 123456789|123456789|123456789|1 (print 31 chars)
       %sh: write error: Invalid argument
      
      Also remove the superfluous parentheses.
      Signed-off-by: NMiles Chen <miles.chen@mediatek.com>
      Cc: <linux-mediatek@lists.infradead.org>
      Cc: <wsd_upstream@mediatek.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190912103452.13281-1-miles.chen@mediatek.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      0801de60
    • S
      sched/psi: Do not require setsched permission from the trigger creator · 86511360
      Suren Baghdasaryan 提交于
      commit 04e048cf09d7b5fc995817cdc5ae1acd4482429c upstream.
      
      When a process creates a new trigger by writing into /proc/pressure/*
      files, permissions to write such a file should be used to determine whether
      the process is allowed to do so or not. Current implementation would also
      require such a process to have setsched capability. Setting of psi trigger
      thread's scheduling policy is an implementation detail and should not be
      exposed to the user level. Remove the permission check by using _nocheck
      version of the function.
      Suggested-by: NNick Kralevich <nnk@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: lizefan@huawei.com
      Cc: mingo@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: kernel-team@android.com
      Cc: dennisszhou@gmail.com
      Cc: dennis@kernel.org
      Cc: hannes@cmpxchg.org
      Cc: axboe@kernel.dk
      Link: https://lkml.kernel.org/r/20190730013310.162367-1-surenb@google.comSigned-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      86511360
    • P
      sched/psi: Reduce psimon FIFO priority · 97ff1a75
      Peter Zijlstra 提交于
      commit 14f5c7b46a41a595fc61db37f55721714729e59e upstream.
      
      PSI defaults to a FIFO-99 thread, reduce this to FIFO-1.
      
      FIFO-99 is the very highest priority available to SCHED_FIFO and
      it not a suitable default; it would indicate the psi work is the
      most important work on the machine.
      
      Since Real-Time tasks will have pre-allocated memory and locked it in
      place, Real-Time tasks do not care about PSI. All it needs is to be
      above OTHER.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      97ff1a75
    • J
      blk-cgroup: turn on psi memstall stuff · 9a8eb1ab
      Josef Bacik 提交于
      commit fd112c74652371a023f85d87b70bee7169e8f4d0 upstream.
      
      With the psi stuff in place we can use the memstall flag to indicate
      pressure that happens from throttling.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      9a8eb1ab
    • H
      zswap: use movable memory if zpool support allocate movable memory · 0a943f50
      Hui Zhu 提交于
      commit d2fcd82bb83aab47c6d63aa8c960cd5edb578065 upstream
      
      This is the third version that was updated according to the comments from
      Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
      https://lkml.org/lkml/2019/6/4/973
      
      zswap compresses swap pages into a dynamically allocated RAM-based memory
      pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
      will allocate unmovable pages.  It will increase the number of unmovable
      page blocks that will bad for anti-fragment.
      
      zsmalloc support page migration if request movable page:
              handle = zs_malloc(zram->mem_pool, comp_len,
                      GFP_NOIO | __GFP_HIGHMEM |
                      __GFP_MOVABLE);
      
      And commit "zpool: Add malloc_support_movable to zpool_driver" add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      This commit let zswap allocate block with gfp
      __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.
      
      Following part is test log in a pc that has 8G memory and 2G swap.
      
      Without this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4826062 usecs = 549973 KB/s
      2717908992 bytes / 4864201 usecs = 545661 KB/s
      2717908992 bytes / 4867015 usecs = 545346 KB/s
      2717908992 bytes / 4915485 usecs = 539968 KB/s
      397853 usecs to free memory
      357820 usecs to free memory
      421333 usecs to free memory
      420454 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
      Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
      Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
      Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1652            0            0            0            0
      Node 0, zone   Normal          931         1485           15            0            0            0
      
      With this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4689240 usecs = 566020 KB/s
      2717908992 bytes / 4760605 usecs = 557535 KB/s
      2717908992 bytes / 4803621 usecs = 552543 KB/s
      2717908992 bytes / 5069828 usecs = 523530 KB/s
      431546 usecs to free memory
      383397 usecs to free memory
      456454 usecs to free memory
      224487 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
      Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
      Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
      Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1650            2            0            0            0
      Node 0, zone   Normal           79         2326           26            0            0            0
      
      You can see that the number of unmovable page blocks is decreased
      when the kernel has this commit.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      0a943f50
    • H
      zpool: add malloc_support_movable to zpool_driver · e874b6e5
      Hui Zhu 提交于
      commit c165f25d23ecb2f9f121ced20435415b931219e2 upstream
      
      As a zpool_driver, zsmalloc can allocate movable memory because it support
      migate pages.  But zbud and z3fold cannot allocate movable memory.
      
      Add malloc_support_movable to zpool_driver.  If a zpool_driver support
      allocate movable memory, set it to true.  And add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e874b6e5
    • G
      ip_sockglue: Fix missing-check bug in ip_ra_control() · c6a8ac6f
      Gen Zhang 提交于
      commit 425aa0e1d01513437668fa3d4a971168bbaa8515 upstream.
      
      In function ip_ra_control(), the pointer new_ra is allocated a memory
      space via kmalloc(). And it is used in the following codes. However,
      when  there is a memory allocation error, kmalloc() fails. Thus null
      pointer dereference may happen. And it will cause the kernel to crash.
      Therefore, we should check the return value and handle the error.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c6a8ac6f
    • G
      efi/x86/Add missing error handling to old_memmap 1:1 mapping code · b47d082f
      Gen Zhang 提交于
      commit 4e78921ba4dd0aca1cc89168f45039add4183f8e upstream.
      
      The old_memmap flow in efi_call_phys_prolog() performs numerous memory
      allocations, and either does not check for failure at all, or it does
      but fails to propagate it back to the caller, which may end up calling
      into the firmware with an incomplete 1:1 mapping.
      
      So let's fix this by returning NULL from efi_call_phys_prolog() on
      memory allocation failures only, and by handling this condition in the
      caller. Also, clean up any half baked sets of page tables that we may
      have created before returning with a NULL return value.
      
      Note that any failure at this level will trigger a panic() two levels
      up, so none of this makes a huge difference, but it is a nice cleanup
      nonetheless.
      
      [ardb: update commit log, add efi_call_phys_epilog() call on error path]
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Bradford <robert.bradford@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190525112559.7917-2-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b47d082f
    • G
      ipv6_sockglue: Fix a missing-check bug in ip6_ra_control() · 1f3d34ec
      Gen Zhang 提交于
      commit 95baa60a0da80a0143e3ddd4d3725758b4513825 upstream.
      
      In function ip6_ra_control(), the pointer new_ra is allocated a memory
      space via kmalloc(). And it is used in the following codes. However,
      when there is a memory allocation error, kmalloc() fails. Thus null
      pointer dereference may happen. And it will cause the kernel to crash.
      Therefore, we should check the return value and handle the error.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1f3d34ec
    • G
      scsi: mpt3sas_ctl: fix double-fetch bug in _ctl_ioctl_main() · b9947abb
      Gen Zhang 提交于
      commit f9e3ebeea4521652318af903cddeaf033527e93e upstream.
      
      In _ctl_ioctl_main(), 'ioctl_header' is fetched the first time from
      userspace. 'ioctl_header.ioc_number' is then checked. The legal result is
      saved to 'ioc'. Then, in condition MPT3COMMAND, the whole struct is fetched
      again from the userspace. Then _ctl_do_mpt_command() is called, 'ioc' and
      'karg' as inputs.
      
      However, a malicious user can change the 'ioc_number' between the two
      fetches, which will cause a potential security issues.  Moreover, a
      malicious user can provide a valid 'ioc_number' to pass the check in first
      fetch, and then modify it in the second fetch.
      
      To fix this, we need to recheck the 'ioc_number' in the second fetch.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Acked-by: NSuganath Prabu S <suganath-prabu.subramani@broadcom.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b9947abb
    • G
      clk-sunxi: fix a missing-check bug in sunxi_divs_clk_setup() · 6731e4fa
      Gen Zhang 提交于
      commit fcdf445ff42f036d22178b49cf64e92d527c1330 upstream.
      
      In sunxi_divs_clk_setup(), 'derived_name' is allocated by kstrndup().
      It returns NULL when fails. 'derived_name' should be checked.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NMaxime Ripard <maxime.ripard@bootlin.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6731e4fa
    • G
      powerpc/pseries/dlpar: Fix a missing check in dlpar_parse_cc_property() · 8625d444
      Gen Zhang 提交于
      commit efa9ace68e487ddd29c2b4d6dd23242158f1f607 upstream.
      
      In dlpar_parse_cc_property(), 'prop->name' is allocated by kstrdup().
      kstrdup() may return NULL, so it should be checked and handle error.
      And prop should be freed if 'prop->name' is NULL.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8625d444
    • M
    • A
      reduce e1000e boot time by tightening sleep ranges · 14898532
      Arjan van de Ven 提交于
      commit ab6973aed6200510662856afce5e3d1e386b7b64 upstream.
      
      The e1000e driver is a great user of the usleep_range() API,
      and has any nice ranges that in principle help power management.
      
      However the ranges that are used only during system startup are
      very long (and can add easily 100 msec to the boot time) while
      the power savings of such long ranges is irrelevant due to the
      one-off, boot only, nature of these functions.
      
      This patch shrinks some of the longest ranges to be shorter
      (while still using a power friendly 1 msec range); this saves
      100msec+ of boot time on my BDW NUCs
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Tested-by: NAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      14898532
    • J
      psi: get poll_work to run when calling poll syscall next time · 0b57115b
      Jason Xing 提交于
      Only when calling the poll syscall the first time can user
      receive POLLPRI correctly. After that, user always fails to
      acquire the event signal.
      
      Reproduce case:
      1. Get the monitor code in Documentation/accounting/psi.txt
      2. Run it, and wait for the event triggered.
      3. Kill and restart the process.
      
      The question is why we can end up with poll_scheduled = 1 but the work
      not running (which would reset it to 0). And the answer is because the
      scheduling side sees group->poll_kworker under RCU protection and then
      schedules it, but here we cancel the work and destroy the worker. The
      cancel needs to pair with resetting the poll_scheduled flag.
      Signed-off-by: NJason Xing <kerneljasonxing@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0b57115b
    • E
      ext4: fix bigalloc cluster freeing when hole punching under load · 8fac7eda
      Eric Whitney 提交于
      commit 7bd75230b43727b258a4f7a59d62114cffe1b6c8 upstream.
      
      Ext4 may not free clusters correctly when punching holes in bigalloc
      file systems under high load conditions.  If it's not possible to
      extend and restart the journal in ext4_ext_rm_leaf() when preparing to
      remove blocks from a punched region, a retry of the entire punch
      operation is triggered in ext4_ext_remove_space().  This causes a
      partial cluster to be set to the first cluster in the extent found to
      the right of the punched region.  However, if the punch operation
      prior to the retry had made enough progress to delete one or more
      extents and a partial cluster candidate for freeing had already been
      recorded, the retry would overwrite the partial cluster.  The loss of
      this information makes it impossible to correctly free the original
      partial cluster in all cases.
      
      This bug can cause generic/476 to fail when run as part of
      xfstests-bld's bigalloc and bigalloc_1k test cases.  The failure is
      reported when e2fsck detects bad iblocks counts greater than expected
      in units of whole clusters and also detects a number of negative block
      bitmap differences equal to the iblocks discrepancy in cluster units.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8fac7eda
    • G
      ext4: fix build error when DX_DEBUG is defined · f443929b
      Gabriel Krisman Bertazi 提交于
      commit 799578ab16e86b074c184ec5abbda0bc698c7b0b upstream.
      
      Enabling DX_DEBUG triggers the build error below.  info is an attribute
      of  the dxroot structure.
      
      linux/fs/ext4/namei.c:2264:12: error: ‘info’
      undeclared (first use in this function); did you mean ‘insl’?
      	   	  info->indirect_levels));
      
      Fixes: e08ac99f ("ext4: add largedir feature")
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f443929b
    • D
      mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · a432d982
      Dave Chinner 提交于
      commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 upstream.
      
      We've recently seen a workload on XFS filesystems with a repeatable
      deadlock between background writeback and a multi-process application
      doing concurrent writes and fsyncs to a small range of a file.
      
      range_cyclic
      writeback		Process 1		Process 2
      
      xfs_vm_writepages
        write_cache_pages
          writeback_index = 2
          cycled = 0
          ....
          find page 2 dirty
          lock Page 2
          ->writepage
            page 2 writeback
            page 2 clean
            page 2 added to bio
          no more pages
      			write()
      			locks page 1
      			dirties page 1
      			locks page 2
      			dirties page 1
      			fsync()
      			....
      			xfs_vm_writepages
      			write_cache_pages
      			  start index 0
      			  find page 1 towrite
      			  lock Page 1
      			  ->writepage
      			    page 1 writeback
      			    page 1 clean
      			    page 1 added to bio
      			  find page 2 towrite
      			  lock Page 2
      			  page 2 is writeback
      			  <blocks>
      						write()
      						locks page 1
      						dirties page 1
      						fsync()
      						....
      						xfs_vm_writepages
      						write_cache_pages
      						  start index 0
      
          !done && !cycled
            sets index to 0, restarts lookup
          find page 1 dirty
      						  find page 1 towrite
      						  lock Page 1
      						  page 1 is writeback
      						  <blocks>
      
          lock Page 1
          <blocks>
      
      DEADLOCK because:
      
      	- process 1 needs page 2 writeback to complete to make
      	  enough progress to issue IO pending for page 1
      	- writeback needs page 1 writeback to complete so process 2
      	  can progress and unlock the page it is blocked on, then it
      	  can issue the IO pending for page 2
      	- process 2 can't make progress until process 1 issues IO
      	  for page 1
      
      The underlying cause of the problem here is that range_cyclic writeback is
      processing pages in descending index order as we hold higher index pages
      in a structure controlled from above write_cache_pages().  The
      write_cache_pages() caller needs to be able to submit these pages for IO
      before write_cache_pages restarts writeback at mapping index 0 to avoid
      wcp inverting the page lock/writeback wait order.
      
      generic_writepages() is not susceptible to this bug as it has no private
      context held across write_cache_pages() - filesystems using this
      infrastructure always submit pages in ->writepage immediately and so there
      is no problem with range_cyclic going back to mapping index 0.
      
      However:
      	mpage_writepages() has a private bio context,
      	exofs_writepages() has page_collect
      	fuse_writepages() has fuse_fill_wb_data
      	nfs_writepages() has nfs_pageio_descriptor
      	xfs_vm_writepages() has xfs_writepage_ctx
      
      All of these ->writepages implementations can hold pages under writeback
      in their private structures until write_cache_pages() returns, and hence
      they are all susceptible to this deadlock.
      
      Also worth noting is that ext4 has it's own bastardised version of
      write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
      at the code long enough to understand that it has a similar retry loop for
      range_cyclic writeback reaching the end of the file and then promptly ran
      away before my eyes bled too much.  I'll leave it for the ext4 developers
      to determine if their code is actually has this deadlock and how to fix it
      if it has.
      
      There's a few ways I can see avoid this deadlock.  There's probably more,
      but these are the first I've though of:
      
      1. get rid of range_cyclic altogether
      
      2. range_cyclic always stops at EOF, and we start again from
      writeback index 0 on the next call into write_cache_pages()
      
      2a. wcp also returns EAGAIN to ->writepages implementations to
      indicate range cyclic has hit EOF. writepages implementations can
      then flush the current context and call wpc again to continue. i.e.
      lift the retry into the ->writepages implementation
      
      3. range_cyclic uses trylock_page() rather than lock_page(), and it
      skips pages it can't lock without blocking. It will already do this
      for pages under writeback, so this seems like a no-brainer
      
      3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
      blocking as per pages under writeback.
      
      I don't think #1 is an option - range_cyclic prevents frequently
      dirtied lower file offset from starving background writeback of
      rarely touched higher file offsets.
      
      performance as going back to the start of the file implies an
      immediate seek. We'll have exactly the same number of seeks if we
      switch writeback to another inode, and then come back to this one
      later and restart from index 0.
      
      retry loop up into the wcp caller means we can issue IO on the
      pending pages before calling wcp again, and so avoid locking or
      waiting on pages in the wrong order. I'm not convinced we need to do
      this given that we get the same thing from #2 on the next writeback
      call from the writeback infrastructure.
      
      inversion problem, just prevents it from becoming a deadlock
      situation. I'd prefer we fix the inversion, not sweep it under the
      carpet like this.
      
      band-aid fix of #3.
      
      So it seems that the simplest way to fix this issue is to implement
      solution #2
      
      Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a432d982
    • M
      block: fix single range discard merge · d6b88955
      Ming Lei 提交于
      commit 2a5cf35cd6c56b2924bce103413ad3381bdc31fa upstream.
      
      There are actually two kinds of discard merge:
      
      - one is the normal discard merge, just like normal read/write request,
      and call it single-range discard
      
      - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1
      
      For the former case, queue_max_discard_segments(rq->q) is 1, and we
      should handle this kind of discard merge like the normal read/write
      request.
      
      This patch fixes the following kernel panic issue[1], which is caused by
      not removing the single-range discard request from elevator queue.
      
      Guangwu has one raid discard test case, in which this issue is a bit
      easier to trigger, and I verified that this patch can fix the kernel
      panic issue in Guangwu's test case.
      
      [1] kernel panic log from Jens's report
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
       PGD 0 P4D 0.
       Oops: 0000 [#1] SMP PTI
       CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
      4.20.0-rc3-00649-ge64d9a554a91-dirty #14  Hardware name: Wiwynn \
      Leopard-Orv2/Leopard-DDR BW, BIOS LBM08   03/03/2017       Workqueue: kblockd \
      blk_mq_run_work_fn                                            RIP: \
      0010:blk_mq_get_driver_tag+0x81/0x120                                       Code: 24 \
      10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \
      0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \
      f6 87 b0 00 00 00 02  RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246                     \
        RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8
       RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000
       RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000
       R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300
       R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000
       FS:  0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        blk_mq_dispatch_rq_list+0xec/0x480
        ? elv_rb_del+0x11/0x30
        blk_mq_do_dispatch_sched+0x6e/0xf0
        blk_mq_sched_dispatch_requests+0xfa/0x170
        __blk_mq_run_hw_queue+0x5f/0xe0
        process_one_work+0x154/0x350
        worker_thread+0x46/0x3c0
        kthread+0xf5/0x130
        ? process_one_work+0x350/0x350
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x1f/0x30
       Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \
      kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \
      cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \
      button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \
      nvme_core fuse sg loop efivarfs autofs4  CR2: 0000000000000148                        \
      
       ---[ end trace 340a1fb996df1b9b ]---
       RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
       Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \
      00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \
      20 72 37 f6 87 b0 00 00 00 02
      
      Fixes: 445251d0 ("blk-mq: fix discard merge with scheduler attached")
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Cc: Guangwu Zhang <guazhang@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d6b88955
    • O
      NFSv4.x: fix lock recovery during delegation recall · 23cb2e27
      Olga Kornievskaia 提交于
      commit 44f411c353bf6d98d5a34f8f1b8605d43b2e50b8 upstream.
      
      Running "./nfstest_delegation --runtest recall26" uncovers that
      client doesn't recover the lock when we have an appending open,
      where the initial open got a write delegation.
      
      Instead of checking for the passed in open context against
      the file lock's open context. Check that the state is the same.
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      23cb2e27
    • J
      block: fix the DISCARD request merge · 4ff7fa23
      Jianchao Wang 提交于
      commit 69840466086d2248898020a08dda52732686c4e6 upstream.
      
      There are two cases when handle DISCARD merge.
      If max_discard_segments == 1, the bios/requests need to be contiguous
      to merge. If max_discard_segments > 1, it takes every bio as a range
      and different range needn't to be contiguous.
      
      But now, attempt_merge screws this up. It always consider contiguity
      for DISCARD for the case max_discard_segments > 1 and cannot merge
      contiguous DISCARD for the case max_discard_segments == 1, because
      rq_attempt_discard_merge always returns false in this case.
      This patch fixes both of the two cases above.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4ff7fa23
    • H
      dm raid: fix false -EBUSY when handling check/repair message · 8456775a
      Heinz Mauelshagen 提交于
      commit 74694bcbdf7e28a5ad548cdda9ac56d30be00d13 upstream.
      
      Sending a check/repair message infrequently leads to -EBUSY instead of
      properly identifying an active resync.  This occurs because
      raid_message() is testing recovery bits in a racy way.
      
      Fix by calling decipher_sync_action() from raid_message() to properly
      identify the idle state of the RAID device.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8456775a
    • D
      xfs: fix use-after-free race in xfs_buf_rele · aaef4a2d
      Dave Chinner 提交于
      commit 37fd1678245f7a5898c1b05128bc481fb403c290 upstream.
      
      When looking at a 4.18 based KASAN use after free report, I noticed
      that racing xfs_buf_rele() may race on dropping the last reference
      to the buffer and taking the buffer lock. This was the symptom
      displayed by the KASAN report, but the actual issue that was
      reported had already been fixed in 4.19-rc1 by commit e339dd8d
      ("xfs: use sync buffer I/O for sync delwri queue submission").
      
      Despite this, I think there is still an issue with xfs_buf_rele()
      in this code:
      
              release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
              spin_lock(&bp->b_lock);
              if (!release) {
      .....
      
      If two threads race on the b_lock after both dropping a reference
      and one getting dropping the last reference so release = true, we
      end up with:
      
      CPU 0				CPU 1
      atomic_dec_and_lock()
      				atomic_dec_and_lock()
      				spin_lock(&bp->b_lock)
      spin_lock(&bp->b_lock)
      <spins>
      				<release = true bp->b_lru_ref = 0>
      				<remove from lists>
      				freebuf = true
      				spin_unlock(&bp->b_lock)
      				xfs_buf_free(bp)
      <gets lock, reading and writing freed memory>
      <accesses freed memory>
      spin_unlock(&bp->b_lock) <reads/writes freed memory>
      
      IOWs, we can't safely take bp->b_lock after dropping the hold
      reference because the buffer may go away at any time after we
      drop that reference. However, this can be fixed simply by taking the
      bp->b_lock before we drop the reference.
      
      It is safe to nest the pag_buf_lock inside bp->b_lock as the
      pag_buf_lock is only used to serialise against lookup in
      xfs_buf_find() and no other locks are held over or under the
      pag_buf_lock there. Make this clear by documenting the buffer lock
      orders at the top of the file.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      aaef4a2d
    • W
      x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() · 545a8448
      Will Deacon 提交于
      commit 6e693b3ffecb0b478c7050b44a4842854154f715 upstream.
      
      Commit 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")
      makes the access_ok() check part of the user_access_begin() preceding a
      series of 'unsafe' accesses.  This has the desirable effect of ensuring
      that all 'unsafe' accesses have been range-checked, without having to
      pick through all of the callsites to verify whether the appropriate
      checking has been made.
      
      However, the consolidated range check does not inhibit speculation, so
      it is still up to the caller to ensure that they are not susceptible to
      any speculative side-channel attacks for user addresses that ultimately
      fail the access_ok() check.
      
      This is an oversight, so use __uaccess_begin_nospec() to ensure that
      speculation is inhibited until the access_ok() check has passed.
      Reported-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      545a8448
    • L
      make 'user_access_begin()' do 'access_ok()' · 747f8a41
      Linus Torvalds 提交于
      commit 594cc251fdd0d231d342d88b2fdff4bc42fb0690 upstream.
      
      Originally, the rule used to be that you'd have to do access_ok()
      separately, and then user_access_begin() before actually doing the
      direct (optimized) user access.
      
      But experience has shown that people then decide not to do access_ok()
      at all, and instead rely on it being implied by other operations or
      similar.  Which makes it very hard to verify that the access has
      actually been range-checked.
      
      If you use the unsafe direct user accesses, hardware features (either
      SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
      Access Never - on ARM) do force you to use user_access_begin().  But
      nothing really forces the range check.
      
      By putting the range check into user_access_begin(), we actually force
      people to do the right thing (tm), and the range check vill be visible
      near the actual accesses.  We have way too long a history of people
      trying to avoid them.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [ Shile: fix following conflicts by adding a dummy arguments ]
      Conflicts:
      	kernel/compat.c
      	kernel/exit.c
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      747f8a41
    • L
      i915: fix missing user_access_end() in page fault exception case · a5de48e5
      Linus Torvalds 提交于
      commit 0b2c8f8b6b0c7530e2866c95862546d0da2057b0 upstream.
      
      When commit fddcd00a49e9 ("drm/i915: Force the slow path after a
      user-write error") unified the error handling for various user access
      problems, it didn't do the user_access_end() that is needed for the
      unsafe_put_user() case.
      
      It's not a huge deal: a missed user_access_end() will only mean that
      SMAP protection isn't active afterwards, and for the error case we'll be
      returning to user mode soon enough anyway.  But it's wrong, and adding
      the proper user_access_end() is trivial enough (and doing it for the
      other error cases where it isn't needed doesn't hurt).
      
      I noticed it while doing the same prep-work for changing
      user_access_begin() that precipitated the access_ok() changes in commit
      96d4f267e40f ("Remove 'type' argument from access_ok() function").
      
      Fixes: fddcd00a49e9 ("drm/i915: Force the slow path after a user-write error")
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: stable@kernel.org # v4.20
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a5de48e5
    • C
      drm/i915: Force the slow path after a user-write error · a11346be
      Chris Wilson 提交于
      commit fddcd00a49e9122a3579247151e9cb3ce5a1a36e upstream.
      
      If we fail to write the user relocation back when it is changed, force
      ourselves to take the slow relocation path where we can handle faults in
      the write path. There is still an element of dubiousness as having
      patched up the batch to use the correct offset, it no longer matches the
      presumed_offset in the relocation, so a second pass may miss any changes
      in layout.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180903083337.13134-3-chris@chris-wilson.co.ukSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a11346be
    • A
      userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 6fc1da0b
      Andrea Arcangeli 提交于
      commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream.
      
      get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
      not be waiting for userfaults before failing and it would hit on a SIGBUS
      instead.  Using get_user_pages_locked/unlocked instead will allow
      get_mempolicy to allow userfaults to resolve the fault and fill the hole,
      before grabbing the node id of the page.
      
      If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
      address inside an area managed by uffd and there is no page at that
      address, the page allocation from within get_mempolicy() will fail
      because get_user_pages() does not allow for page fault retry required
      for uffd; the user will get SIGBUS.
      
      With this patch, the page fault will be resolved by the uffd and the
      get_mempolicy() will continue normally.
      
      Background:
      
      Via code review, previously the syscall would have returned -EFAULT
      (vm_fault_to_errno), now it will block and wait for an userfault (if
      it's waken before the fault is resolved it'll still -EFAULT).
      
      This way get_mempolicy will give a chance to an "unaware" app to be
      compliant with userfaults.
      
      The reason this visible change is that becoming "userfault compliant"
      cannot regress anything: all other syscalls including read(2)/write(2)
      had to become "userfault compliant" long time ago (that's one of the
      things userfaultfd can do that PROT_NONE and trapping segfaults can't).
      
      So this is just one more syscall that become "userfault compliant" like
      all other major ones already were.
      
      This has been happening on virtio-bridge dpdk process which just called
      get_mempolicy on the guest space post live migration, but before the
      memory had a chance to be migrated to destination.
      
      I didn't run an strace to be able to show the -EFAULT going away, but
      I've the confirmation of the below debug aid information (only visible
      with CONFIG_DEBUG_VM=y) going away with the patch:
      
          [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
          [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
          [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
          [20116.371466] Call Trace:
          [20116.371473]  dump_stack+0x5c/0x80
          [20116.371476]  handle_userfault.cold.37+0x1b/0x22
          [20116.371479]  ? remove_wait_queue+0x20/0x60
          [20116.371481]  ? poll_freewait+0x45/0xa0
          [20116.371483]  ? do_sys_poll+0x31c/0x520
          [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
          [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
          [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
          [20116.371493]  shmem_fault+0x78/0x1e0
          [20116.371495]  ? filemap_map_pages+0x3a1/0x450
          [20116.371498]  __do_fault+0x1f/0xc0
          [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
          [20116.371502]  handle_mm_fault+0xda/0x200
          [20116.371504]  __get_user_pages+0x238/0x790
          [20116.371506]  get_user_pages+0x3e/0x50
          [20116.371510]  kernel_get_mempolicy+0x40b/0x700
          [20116.371512]  ? vfs_write+0x170/0x1a0
          [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
          [20116.371517]  do_syscall_64+0x5b/0x160
          [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above harmless debug message (not a kernel crash, just a
      dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
      and improve kernel spots that may have to become "userfaultfd
      compliant" like this one (without having to run an strace and search
      for syscall misbehavior).  Spots like the above are more closer to a
      kernel bug for the non-cooperative usages that Mike focuses on, than
      for for dpdk qemu-cooperative usages that reproduced it, but it's still
      nicer to get this fixed for dpdk too.
      
      The part of the patch that caused me to think is only the
      implementation issue of mpol_get, but it looks like it should work safe
      no matter the kind of mempolicy structure that is (the default static
      policy also starts at 1 so it'll go to 2 and back to 1 without crashing
      everything at 0).
      
      [rppt@linux.vnet.ibm.com: changelog addition]
        http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
      Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6fc1da0b
    • X
      random: speed up the initialization of module · 94eb2632
      Xingjun Liu 提交于
      During the module initialization phase, entropy will be added
      to entropy pool for every interrupt, the change should speed up
      initialization of the random module.
      
      Before optimization:
      [   22.180236] random: crng init done
      
      After optimization:
      [    1.474832] random: crng init done
      Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
      Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: jia zhang's avatarJia Zhang <zhang.jia@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      94eb2632
    • X
      random: introduce the initialization seed · 0d269934
      Xingjun Liu 提交于
      Add random entropy with the module parameter as the initialization
      seed when the kernel startup.
      
      For guest OS working in VM, the random entropy will be less,
      it cause the random module to initialize very slowly, and if
      the application which running in guest os gets a certain amount of
      random numbers in the initialization phase, it will be blocked.
      
      This patch allows the VMM to provide a certain amount of random seed
      when starting guest OS, speeding up the initialization of the entire
      guest OS random module.
      
      Before optimization:
      [   22.180236] random: crng init done
      
      After optimization:
      [    1.553362] random: crng init done
      Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
      Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: jia zhang's avatarJia Zhang <zhang.jia@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      0d269934