1. 30 10月, 2019 40 次提交
    • S
      sched/psi: Do not require setsched permission from the trigger creator · 86511360
      Suren Baghdasaryan 提交于
      commit 04e048cf09d7b5fc995817cdc5ae1acd4482429c upstream.
      
      When a process creates a new trigger by writing into /proc/pressure/*
      files, permissions to write such a file should be used to determine whether
      the process is allowed to do so or not. Current implementation would also
      require such a process to have setsched capability. Setting of psi trigger
      thread's scheduling policy is an implementation detail and should not be
      exposed to the user level. Remove the permission check by using _nocheck
      version of the function.
      Suggested-by: NNick Kralevich <nnk@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: lizefan@huawei.com
      Cc: mingo@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: kernel-team@android.com
      Cc: dennisszhou@gmail.com
      Cc: dennis@kernel.org
      Cc: hannes@cmpxchg.org
      Cc: axboe@kernel.dk
      Link: https://lkml.kernel.org/r/20190730013310.162367-1-surenb@google.comSigned-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      86511360
    • P
      sched/psi: Reduce psimon FIFO priority · 97ff1a75
      Peter Zijlstra 提交于
      commit 14f5c7b46a41a595fc61db37f55721714729e59e upstream.
      
      PSI defaults to a FIFO-99 thread, reduce this to FIFO-1.
      
      FIFO-99 is the very highest priority available to SCHED_FIFO and
      it not a suitable default; it would indicate the psi work is the
      most important work on the machine.
      
      Since Real-Time tasks will have pre-allocated memory and locked it in
      place, Real-Time tasks do not care about PSI. All it needs is to be
      above OTHER.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      97ff1a75
    • J
      blk-cgroup: turn on psi memstall stuff · 9a8eb1ab
      Josef Bacik 提交于
      commit fd112c74652371a023f85d87b70bee7169e8f4d0 upstream.
      
      With the psi stuff in place we can use the memstall flag to indicate
      pressure that happens from throttling.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      9a8eb1ab
    • H
      zswap: use movable memory if zpool support allocate movable memory · 0a943f50
      Hui Zhu 提交于
      commit d2fcd82bb83aab47c6d63aa8c960cd5edb578065 upstream
      
      This is the third version that was updated according to the comments from
      Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
      https://lkml.org/lkml/2019/6/4/973
      
      zswap compresses swap pages into a dynamically allocated RAM-based memory
      pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
      will allocate unmovable pages.  It will increase the number of unmovable
      page blocks that will bad for anti-fragment.
      
      zsmalloc support page migration if request movable page:
              handle = zs_malloc(zram->mem_pool, comp_len,
                      GFP_NOIO | __GFP_HIGHMEM |
                      __GFP_MOVABLE);
      
      And commit "zpool: Add malloc_support_movable to zpool_driver" add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      This commit let zswap allocate block with gfp
      __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.
      
      Following part is test log in a pc that has 8G memory and 2G swap.
      
      Without this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4826062 usecs = 549973 KB/s
      2717908992 bytes / 4864201 usecs = 545661 KB/s
      2717908992 bytes / 4867015 usecs = 545346 KB/s
      2717908992 bytes / 4915485 usecs = 539968 KB/s
      397853 usecs to free memory
      357820 usecs to free memory
      421333 usecs to free memory
      420454 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
      Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
      Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
      Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1652            0            0            0            0
      Node 0, zone   Normal          931         1485           15            0            0            0
      
      With this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4689240 usecs = 566020 KB/s
      2717908992 bytes / 4760605 usecs = 557535 KB/s
      2717908992 bytes / 4803621 usecs = 552543 KB/s
      2717908992 bytes / 5069828 usecs = 523530 KB/s
      431546 usecs to free memory
      383397 usecs to free memory
      456454 usecs to free memory
      224487 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
      Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
      Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
      Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1650            2            0            0            0
      Node 0, zone   Normal           79         2326           26            0            0            0
      
      You can see that the number of unmovable page blocks is decreased
      when the kernel has this commit.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      0a943f50
    • H
      zpool: add malloc_support_movable to zpool_driver · e874b6e5
      Hui Zhu 提交于
      commit c165f25d23ecb2f9f121ced20435415b931219e2 upstream
      
      As a zpool_driver, zsmalloc can allocate movable memory because it support
      migate pages.  But zbud and z3fold cannot allocate movable memory.
      
      Add malloc_support_movable to zpool_driver.  If a zpool_driver support
      allocate movable memory, set it to true.  And add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e874b6e5
    • G
      ip_sockglue: Fix missing-check bug in ip_ra_control() · c6a8ac6f
      Gen Zhang 提交于
      commit 425aa0e1d01513437668fa3d4a971168bbaa8515 upstream.
      
      In function ip_ra_control(), the pointer new_ra is allocated a memory
      space via kmalloc(). And it is used in the following codes. However,
      when  there is a memory allocation error, kmalloc() fails. Thus null
      pointer dereference may happen. And it will cause the kernel to crash.
      Therefore, we should check the return value and handle the error.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c6a8ac6f
    • G
      efi/x86/Add missing error handling to old_memmap 1:1 mapping code · b47d082f
      Gen Zhang 提交于
      commit 4e78921ba4dd0aca1cc89168f45039add4183f8e upstream.
      
      The old_memmap flow in efi_call_phys_prolog() performs numerous memory
      allocations, and either does not check for failure at all, or it does
      but fails to propagate it back to the caller, which may end up calling
      into the firmware with an incomplete 1:1 mapping.
      
      So let's fix this by returning NULL from efi_call_phys_prolog() on
      memory allocation failures only, and by handling this condition in the
      caller. Also, clean up any half baked sets of page tables that we may
      have created before returning with a NULL return value.
      
      Note that any failure at this level will trigger a panic() two levels
      up, so none of this makes a huge difference, but it is a nice cleanup
      nonetheless.
      
      [ardb: update commit log, add efi_call_phys_epilog() call on error path]
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Bradford <robert.bradford@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190525112559.7917-2-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b47d082f
    • G
      ipv6_sockglue: Fix a missing-check bug in ip6_ra_control() · 1f3d34ec
      Gen Zhang 提交于
      commit 95baa60a0da80a0143e3ddd4d3725758b4513825 upstream.
      
      In function ip6_ra_control(), the pointer new_ra is allocated a memory
      space via kmalloc(). And it is used in the following codes. However,
      when there is a memory allocation error, kmalloc() fails. Thus null
      pointer dereference may happen. And it will cause the kernel to crash.
      Therefore, we should check the return value and handle the error.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1f3d34ec
    • G
      scsi: mpt3sas_ctl: fix double-fetch bug in _ctl_ioctl_main() · b9947abb
      Gen Zhang 提交于
      commit f9e3ebeea4521652318af903cddeaf033527e93e upstream.
      
      In _ctl_ioctl_main(), 'ioctl_header' is fetched the first time from
      userspace. 'ioctl_header.ioc_number' is then checked. The legal result is
      saved to 'ioc'. Then, in condition MPT3COMMAND, the whole struct is fetched
      again from the userspace. Then _ctl_do_mpt_command() is called, 'ioc' and
      'karg' as inputs.
      
      However, a malicious user can change the 'ioc_number' between the two
      fetches, which will cause a potential security issues.  Moreover, a
      malicious user can provide a valid 'ioc_number' to pass the check in first
      fetch, and then modify it in the second fetch.
      
      To fix this, we need to recheck the 'ioc_number' in the second fetch.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Acked-by: NSuganath Prabu S <suganath-prabu.subramani@broadcom.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b9947abb
    • G
      clk-sunxi: fix a missing-check bug in sunxi_divs_clk_setup() · 6731e4fa
      Gen Zhang 提交于
      commit fcdf445ff42f036d22178b49cf64e92d527c1330 upstream.
      
      In sunxi_divs_clk_setup(), 'derived_name' is allocated by kstrndup().
      It returns NULL when fails. 'derived_name' should be checked.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NMaxime Ripard <maxime.ripard@bootlin.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6731e4fa
    • G
      powerpc/pseries/dlpar: Fix a missing check in dlpar_parse_cc_property() · 8625d444
      Gen Zhang 提交于
      commit efa9ace68e487ddd29c2b4d6dd23242158f1f607 upstream.
      
      In dlpar_parse_cc_property(), 'prop->name' is allocated by kstrdup().
      kstrdup() may return NULL, so it should be checked and handle error.
      And prop should be freed if 'prop->name' is NULL.
      Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8625d444
    • M
    • A
      reduce e1000e boot time by tightening sleep ranges · 14898532
      Arjan van de Ven 提交于
      commit ab6973aed6200510662856afce5e3d1e386b7b64 upstream.
      
      The e1000e driver is a great user of the usleep_range() API,
      and has any nice ranges that in principle help power management.
      
      However the ranges that are used only during system startup are
      very long (and can add easily 100 msec to the boot time) while
      the power savings of such long ranges is irrelevant due to the
      one-off, boot only, nature of these functions.
      
      This patch shrinks some of the longest ranges to be shorter
      (while still using a power friendly 1 msec range); this saves
      100msec+ of boot time on my BDW NUCs
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Tested-by: NAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      14898532
    • J
      psi: get poll_work to run when calling poll syscall next time · 0b57115b
      Jason Xing 提交于
      Only when calling the poll syscall the first time can user
      receive POLLPRI correctly. After that, user always fails to
      acquire the event signal.
      
      Reproduce case:
      1. Get the monitor code in Documentation/accounting/psi.txt
      2. Run it, and wait for the event triggered.
      3. Kill and restart the process.
      
      The question is why we can end up with poll_scheduled = 1 but the work
      not running (which would reset it to 0). And the answer is because the
      scheduling side sees group->poll_kworker under RCU protection and then
      schedules it, but here we cancel the work and destroy the worker. The
      cancel needs to pair with resetting the poll_scheduled flag.
      Signed-off-by: NJason Xing <kerneljasonxing@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0b57115b
    • E
      ext4: fix bigalloc cluster freeing when hole punching under load · 8fac7eda
      Eric Whitney 提交于
      commit 7bd75230b43727b258a4f7a59d62114cffe1b6c8 upstream.
      
      Ext4 may not free clusters correctly when punching holes in bigalloc
      file systems under high load conditions.  If it's not possible to
      extend and restart the journal in ext4_ext_rm_leaf() when preparing to
      remove blocks from a punched region, a retry of the entire punch
      operation is triggered in ext4_ext_remove_space().  This causes a
      partial cluster to be set to the first cluster in the extent found to
      the right of the punched region.  However, if the punch operation
      prior to the retry had made enough progress to delete one or more
      extents and a partial cluster candidate for freeing had already been
      recorded, the retry would overwrite the partial cluster.  The loss of
      this information makes it impossible to correctly free the original
      partial cluster in all cases.
      
      This bug can cause generic/476 to fail when run as part of
      xfstests-bld's bigalloc and bigalloc_1k test cases.  The failure is
      reported when e2fsck detects bad iblocks counts greater than expected
      in units of whole clusters and also detects a number of negative block
      bitmap differences equal to the iblocks discrepancy in cluster units.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8fac7eda
    • G
      ext4: fix build error when DX_DEBUG is defined · f443929b
      Gabriel Krisman Bertazi 提交于
      commit 799578ab16e86b074c184ec5abbda0bc698c7b0b upstream.
      
      Enabling DX_DEBUG triggers the build error below.  info is an attribute
      of  the dxroot structure.
      
      linux/fs/ext4/namei.c:2264:12: error: ‘info’
      undeclared (first use in this function); did you mean ‘insl’?
      	   	  info->indirect_levels));
      
      Fixes: e08ac99f ("ext4: add largedir feature")
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f443929b
    • D
      mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · a432d982
      Dave Chinner 提交于
      commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 upstream.
      
      We've recently seen a workload on XFS filesystems with a repeatable
      deadlock between background writeback and a multi-process application
      doing concurrent writes and fsyncs to a small range of a file.
      
      range_cyclic
      writeback		Process 1		Process 2
      
      xfs_vm_writepages
        write_cache_pages
          writeback_index = 2
          cycled = 0
          ....
          find page 2 dirty
          lock Page 2
          ->writepage
            page 2 writeback
            page 2 clean
            page 2 added to bio
          no more pages
      			write()
      			locks page 1
      			dirties page 1
      			locks page 2
      			dirties page 1
      			fsync()
      			....
      			xfs_vm_writepages
      			write_cache_pages
      			  start index 0
      			  find page 1 towrite
      			  lock Page 1
      			  ->writepage
      			    page 1 writeback
      			    page 1 clean
      			    page 1 added to bio
      			  find page 2 towrite
      			  lock Page 2
      			  page 2 is writeback
      			  <blocks>
      						write()
      						locks page 1
      						dirties page 1
      						fsync()
      						....
      						xfs_vm_writepages
      						write_cache_pages
      						  start index 0
      
          !done && !cycled
            sets index to 0, restarts lookup
          find page 1 dirty
      						  find page 1 towrite
      						  lock Page 1
      						  page 1 is writeback
      						  <blocks>
      
          lock Page 1
          <blocks>
      
      DEADLOCK because:
      
      	- process 1 needs page 2 writeback to complete to make
      	  enough progress to issue IO pending for page 1
      	- writeback needs page 1 writeback to complete so process 2
      	  can progress and unlock the page it is blocked on, then it
      	  can issue the IO pending for page 2
      	- process 2 can't make progress until process 1 issues IO
      	  for page 1
      
      The underlying cause of the problem here is that range_cyclic writeback is
      processing pages in descending index order as we hold higher index pages
      in a structure controlled from above write_cache_pages().  The
      write_cache_pages() caller needs to be able to submit these pages for IO
      before write_cache_pages restarts writeback at mapping index 0 to avoid
      wcp inverting the page lock/writeback wait order.
      
      generic_writepages() is not susceptible to this bug as it has no private
      context held across write_cache_pages() - filesystems using this
      infrastructure always submit pages in ->writepage immediately and so there
      is no problem with range_cyclic going back to mapping index 0.
      
      However:
      	mpage_writepages() has a private bio context,
      	exofs_writepages() has page_collect
      	fuse_writepages() has fuse_fill_wb_data
      	nfs_writepages() has nfs_pageio_descriptor
      	xfs_vm_writepages() has xfs_writepage_ctx
      
      All of these ->writepages implementations can hold pages under writeback
      in their private structures until write_cache_pages() returns, and hence
      they are all susceptible to this deadlock.
      
      Also worth noting is that ext4 has it's own bastardised version of
      write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
      at the code long enough to understand that it has a similar retry loop for
      range_cyclic writeback reaching the end of the file and then promptly ran
      away before my eyes bled too much.  I'll leave it for the ext4 developers
      to determine if their code is actually has this deadlock and how to fix it
      if it has.
      
      There's a few ways I can see avoid this deadlock.  There's probably more,
      but these are the first I've though of:
      
      1. get rid of range_cyclic altogether
      
      2. range_cyclic always stops at EOF, and we start again from
      writeback index 0 on the next call into write_cache_pages()
      
      2a. wcp also returns EAGAIN to ->writepages implementations to
      indicate range cyclic has hit EOF. writepages implementations can
      then flush the current context and call wpc again to continue. i.e.
      lift the retry into the ->writepages implementation
      
      3. range_cyclic uses trylock_page() rather than lock_page(), and it
      skips pages it can't lock without blocking. It will already do this
      for pages under writeback, so this seems like a no-brainer
      
      3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
      blocking as per pages under writeback.
      
      I don't think #1 is an option - range_cyclic prevents frequently
      dirtied lower file offset from starving background writeback of
      rarely touched higher file offsets.
      
      performance as going back to the start of the file implies an
      immediate seek. We'll have exactly the same number of seeks if we
      switch writeback to another inode, and then come back to this one
      later and restart from index 0.
      
      retry loop up into the wcp caller means we can issue IO on the
      pending pages before calling wcp again, and so avoid locking or
      waiting on pages in the wrong order. I'm not convinced we need to do
      this given that we get the same thing from #2 on the next writeback
      call from the writeback infrastructure.
      
      inversion problem, just prevents it from becoming a deadlock
      situation. I'd prefer we fix the inversion, not sweep it under the
      carpet like this.
      
      band-aid fix of #3.
      
      So it seems that the simplest way to fix this issue is to implement
      solution #2
      
      Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a432d982
    • M
      block: fix single range discard merge · d6b88955
      Ming Lei 提交于
      commit 2a5cf35cd6c56b2924bce103413ad3381bdc31fa upstream.
      
      There are actually two kinds of discard merge:
      
      - one is the normal discard merge, just like normal read/write request,
      and call it single-range discard
      
      - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1
      
      For the former case, queue_max_discard_segments(rq->q) is 1, and we
      should handle this kind of discard merge like the normal read/write
      request.
      
      This patch fixes the following kernel panic issue[1], which is caused by
      not removing the single-range discard request from elevator queue.
      
      Guangwu has one raid discard test case, in which this issue is a bit
      easier to trigger, and I verified that this patch can fix the kernel
      panic issue in Guangwu's test case.
      
      [1] kernel panic log from Jens's report
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
       PGD 0 P4D 0.
       Oops: 0000 [#1] SMP PTI
       CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
      4.20.0-rc3-00649-ge64d9a554a91-dirty #14  Hardware name: Wiwynn \
      Leopard-Orv2/Leopard-DDR BW, BIOS LBM08   03/03/2017       Workqueue: kblockd \
      blk_mq_run_work_fn                                            RIP: \
      0010:blk_mq_get_driver_tag+0x81/0x120                                       Code: 24 \
      10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \
      0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \
      f6 87 b0 00 00 00 02  RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246                     \
        RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8
       RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000
       RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000
       R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300
       R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000
       FS:  0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        blk_mq_dispatch_rq_list+0xec/0x480
        ? elv_rb_del+0x11/0x30
        blk_mq_do_dispatch_sched+0x6e/0xf0
        blk_mq_sched_dispatch_requests+0xfa/0x170
        __blk_mq_run_hw_queue+0x5f/0xe0
        process_one_work+0x154/0x350
        worker_thread+0x46/0x3c0
        kthread+0xf5/0x130
        ? process_one_work+0x350/0x350
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x1f/0x30
       Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \
      kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \
      cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \
      button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \
      nvme_core fuse sg loop efivarfs autofs4  CR2: 0000000000000148                        \
      
       ---[ end trace 340a1fb996df1b9b ]---
       RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
       Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \
      00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \
      20 72 37 f6 87 b0 00 00 00 02
      
      Fixes: 445251d0 ("blk-mq: fix discard merge with scheduler attached")
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Cc: Guangwu Zhang <guazhang@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d6b88955
    • O
      NFSv4.x: fix lock recovery during delegation recall · 23cb2e27
      Olga Kornievskaia 提交于
      commit 44f411c353bf6d98d5a34f8f1b8605d43b2e50b8 upstream.
      
      Running "./nfstest_delegation --runtest recall26" uncovers that
      client doesn't recover the lock when we have an appending open,
      where the initial open got a write delegation.
      
      Instead of checking for the passed in open context against
      the file lock's open context. Check that the state is the same.
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      23cb2e27
    • J
      block: fix the DISCARD request merge · 4ff7fa23
      Jianchao Wang 提交于
      commit 69840466086d2248898020a08dda52732686c4e6 upstream.
      
      There are two cases when handle DISCARD merge.
      If max_discard_segments == 1, the bios/requests need to be contiguous
      to merge. If max_discard_segments > 1, it takes every bio as a range
      and different range needn't to be contiguous.
      
      But now, attempt_merge screws this up. It always consider contiguity
      for DISCARD for the case max_discard_segments > 1 and cannot merge
      contiguous DISCARD for the case max_discard_segments == 1, because
      rq_attempt_discard_merge always returns false in this case.
      This patch fixes both of the two cases above.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4ff7fa23
    • H
      dm raid: fix false -EBUSY when handling check/repair message · 8456775a
      Heinz Mauelshagen 提交于
      commit 74694bcbdf7e28a5ad548cdda9ac56d30be00d13 upstream.
      
      Sending a check/repair message infrequently leads to -EBUSY instead of
      properly identifying an active resync.  This occurs because
      raid_message() is testing recovery bits in a racy way.
      
      Fix by calling decipher_sync_action() from raid_message() to properly
      identify the idle state of the RAID device.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8456775a
    • D
      xfs: fix use-after-free race in xfs_buf_rele · aaef4a2d
      Dave Chinner 提交于
      commit 37fd1678245f7a5898c1b05128bc481fb403c290 upstream.
      
      When looking at a 4.18 based KASAN use after free report, I noticed
      that racing xfs_buf_rele() may race on dropping the last reference
      to the buffer and taking the buffer lock. This was the symptom
      displayed by the KASAN report, but the actual issue that was
      reported had already been fixed in 4.19-rc1 by commit e339dd8d
      ("xfs: use sync buffer I/O for sync delwri queue submission").
      
      Despite this, I think there is still an issue with xfs_buf_rele()
      in this code:
      
              release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
              spin_lock(&bp->b_lock);
              if (!release) {
      .....
      
      If two threads race on the b_lock after both dropping a reference
      and one getting dropping the last reference so release = true, we
      end up with:
      
      CPU 0				CPU 1
      atomic_dec_and_lock()
      				atomic_dec_and_lock()
      				spin_lock(&bp->b_lock)
      spin_lock(&bp->b_lock)
      <spins>
      				<release = true bp->b_lru_ref = 0>
      				<remove from lists>
      				freebuf = true
      				spin_unlock(&bp->b_lock)
      				xfs_buf_free(bp)
      <gets lock, reading and writing freed memory>
      <accesses freed memory>
      spin_unlock(&bp->b_lock) <reads/writes freed memory>
      
      IOWs, we can't safely take bp->b_lock after dropping the hold
      reference because the buffer may go away at any time after we
      drop that reference. However, this can be fixed simply by taking the
      bp->b_lock before we drop the reference.
      
      It is safe to nest the pag_buf_lock inside bp->b_lock as the
      pag_buf_lock is only used to serialise against lookup in
      xfs_buf_find() and no other locks are held over or under the
      pag_buf_lock there. Make this clear by documenting the buffer lock
      orders at the top of the file.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      aaef4a2d
    • W
      x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() · 545a8448
      Will Deacon 提交于
      commit 6e693b3ffecb0b478c7050b44a4842854154f715 upstream.
      
      Commit 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")
      makes the access_ok() check part of the user_access_begin() preceding a
      series of 'unsafe' accesses.  This has the desirable effect of ensuring
      that all 'unsafe' accesses have been range-checked, without having to
      pick through all of the callsites to verify whether the appropriate
      checking has been made.
      
      However, the consolidated range check does not inhibit speculation, so
      it is still up to the caller to ensure that they are not susceptible to
      any speculative side-channel attacks for user addresses that ultimately
      fail the access_ok() check.
      
      This is an oversight, so use __uaccess_begin_nospec() to ensure that
      speculation is inhibited until the access_ok() check has passed.
      Reported-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      545a8448
    • L
      make 'user_access_begin()' do 'access_ok()' · 747f8a41
      Linus Torvalds 提交于
      commit 594cc251fdd0d231d342d88b2fdff4bc42fb0690 upstream.
      
      Originally, the rule used to be that you'd have to do access_ok()
      separately, and then user_access_begin() before actually doing the
      direct (optimized) user access.
      
      But experience has shown that people then decide not to do access_ok()
      at all, and instead rely on it being implied by other operations or
      similar.  Which makes it very hard to verify that the access has
      actually been range-checked.
      
      If you use the unsafe direct user accesses, hardware features (either
      SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
      Access Never - on ARM) do force you to use user_access_begin().  But
      nothing really forces the range check.
      
      By putting the range check into user_access_begin(), we actually force
      people to do the right thing (tm), and the range check vill be visible
      near the actual accesses.  We have way too long a history of people
      trying to avoid them.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [ Shile: fix following conflicts by adding a dummy arguments ]
      Conflicts:
      	kernel/compat.c
      	kernel/exit.c
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      747f8a41
    • L
      i915: fix missing user_access_end() in page fault exception case · a5de48e5
      Linus Torvalds 提交于
      commit 0b2c8f8b6b0c7530e2866c95862546d0da2057b0 upstream.
      
      When commit fddcd00a49e9 ("drm/i915: Force the slow path after a
      user-write error") unified the error handling for various user access
      problems, it didn't do the user_access_end() that is needed for the
      unsafe_put_user() case.
      
      It's not a huge deal: a missed user_access_end() will only mean that
      SMAP protection isn't active afterwards, and for the error case we'll be
      returning to user mode soon enough anyway.  But it's wrong, and adding
      the proper user_access_end() is trivial enough (and doing it for the
      other error cases where it isn't needed doesn't hurt).
      
      I noticed it while doing the same prep-work for changing
      user_access_begin() that precipitated the access_ok() changes in commit
      96d4f267e40f ("Remove 'type' argument from access_ok() function").
      
      Fixes: fddcd00a49e9 ("drm/i915: Force the slow path after a user-write error")
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: stable@kernel.org # v4.20
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a5de48e5
    • C
      drm/i915: Force the slow path after a user-write error · a11346be
      Chris Wilson 提交于
      commit fddcd00a49e9122a3579247151e9cb3ce5a1a36e upstream.
      
      If we fail to write the user relocation back when it is changed, force
      ourselves to take the slow relocation path where we can handle faults in
      the write path. There is still an element of dubiousness as having
      patched up the batch to use the correct offset, it no longer matches the
      presumed_offset in the relocation, so a second pass may miss any changes
      in layout.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180903083337.13134-3-chris@chris-wilson.co.ukSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a11346be
    • A
      userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 6fc1da0b
      Andrea Arcangeli 提交于
      commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream.
      
      get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
      not be waiting for userfaults before failing and it would hit on a SIGBUS
      instead.  Using get_user_pages_locked/unlocked instead will allow
      get_mempolicy to allow userfaults to resolve the fault and fill the hole,
      before grabbing the node id of the page.
      
      If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
      address inside an area managed by uffd and there is no page at that
      address, the page allocation from within get_mempolicy() will fail
      because get_user_pages() does not allow for page fault retry required
      for uffd; the user will get SIGBUS.
      
      With this patch, the page fault will be resolved by the uffd and the
      get_mempolicy() will continue normally.
      
      Background:
      
      Via code review, previously the syscall would have returned -EFAULT
      (vm_fault_to_errno), now it will block and wait for an userfault (if
      it's waken before the fault is resolved it'll still -EFAULT).
      
      This way get_mempolicy will give a chance to an "unaware" app to be
      compliant with userfaults.
      
      The reason this visible change is that becoming "userfault compliant"
      cannot regress anything: all other syscalls including read(2)/write(2)
      had to become "userfault compliant" long time ago (that's one of the
      things userfaultfd can do that PROT_NONE and trapping segfaults can't).
      
      So this is just one more syscall that become "userfault compliant" like
      all other major ones already were.
      
      This has been happening on virtio-bridge dpdk process which just called
      get_mempolicy on the guest space post live migration, but before the
      memory had a chance to be migrated to destination.
      
      I didn't run an strace to be able to show the -EFAULT going away, but
      I've the confirmation of the below debug aid information (only visible
      with CONFIG_DEBUG_VM=y) going away with the patch:
      
          [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
          [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
          [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
          [20116.371466] Call Trace:
          [20116.371473]  dump_stack+0x5c/0x80
          [20116.371476]  handle_userfault.cold.37+0x1b/0x22
          [20116.371479]  ? remove_wait_queue+0x20/0x60
          [20116.371481]  ? poll_freewait+0x45/0xa0
          [20116.371483]  ? do_sys_poll+0x31c/0x520
          [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
          [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
          [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
          [20116.371493]  shmem_fault+0x78/0x1e0
          [20116.371495]  ? filemap_map_pages+0x3a1/0x450
          [20116.371498]  __do_fault+0x1f/0xc0
          [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
          [20116.371502]  handle_mm_fault+0xda/0x200
          [20116.371504]  __get_user_pages+0x238/0x790
          [20116.371506]  get_user_pages+0x3e/0x50
          [20116.371510]  kernel_get_mempolicy+0x40b/0x700
          [20116.371512]  ? vfs_write+0x170/0x1a0
          [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
          [20116.371517]  do_syscall_64+0x5b/0x160
          [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above harmless debug message (not a kernel crash, just a
      dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
      and improve kernel spots that may have to become "userfaultfd
      compliant" like this one (without having to run an strace and search
      for syscall misbehavior).  Spots like the above are more closer to a
      kernel bug for the non-cooperative usages that Mike focuses on, than
      for for dpdk qemu-cooperative usages that reproduced it, but it's still
      nicer to get this fixed for dpdk too.
      
      The part of the patch that caused me to think is only the
      implementation issue of mpol_get, but it looks like it should work safe
      no matter the kind of mempolicy structure that is (the default static
      policy also starts at 1 so it'll go to 2 and back to 1 without crashing
      everything at 0).
      
      [rppt@linux.vnet.ibm.com: changelog addition]
        http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
      Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6fc1da0b
    • X
      random: speed up the initialization of module · 94eb2632
      Xingjun Liu 提交于
      During the module initialization phase, entropy will be added
      to entropy pool for every interrupt, the change should speed up
      initialization of the random module.
      
      Before optimization:
      [   22.180236] random: crng init done
      
      After optimization:
      [    1.474832] random: crng init done
      Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
      Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: jia zhang's avatarJia Zhang <zhang.jia@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      94eb2632
    • X
      random: introduce the initialization seed · 0d269934
      Xingjun Liu 提交于
      Add random entropy with the module parameter as the initialization
      seed when the kernel startup.
      
      For guest OS working in VM, the random entropy will be less,
      it cause the random module to initialize very slowly, and if
      the application which running in guest os gets a certain amount of
      random numbers in the initialization phase, it will be blocked.
      
      This patch allows the VMM to provide a certain amount of random seed
      when starting guest OS, speeding up the initialization of the entire
      guest OS random module.
      
      Before optimization:
      [   22.180236] random: crng init done
      
      After optimization:
      [    1.553362] random: crng init done
      Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
      Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: jia zhang's avatarJia Zhang <zhang.jia@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      0d269934
    • B
      cpufreq/intel_pstate: Load only on Intel hardware · d9a55d15
      Borislav Petkov 提交于
      commit 4ab526468344c11d2d1807ae95feb1f5305dc014 upstream.
      
      This driver is Intel-only so loading on anything which is not Intel is
      pointless. Prevent it from doing so.
      
      While at it, correct the "not supported" print statement to say CPU
      "model" which is what that test does.
      
      Fixes: 076b862c7e44 (cpufreq: intel_pstate: Add reasons for failure and debug messages)
      Suggested-by: NErwan Velu <e.velu@criteo.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Renninger <trenn@suse.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      d9a55d15
    • E
      cpufreq: intel_pstate: Add reasons for failure and debug messages · 0f006801
      Erwan Velu 提交于
      commit 076b862c7e4409d2dcacfda19f7eaf8d07ab9200 upstream.
      
      The init code path has several exceptions where the driver can
      decide not to load.
      
      As CONFIG_X86_INTEL_PSTATE is generally set to Y, the return code is
      not reachable.  The initialization code is neither verbose of the
      reason why it did choose to prematurely exit, so it is difficult for
      a user to determine, on a given platform, why the driver didn't load
      properly.
      
      This patch is about reporting to the user the reason/context of why
      the driver failed to load.  That is a precious hint when debugging
      a platform.
      Signed-off-by: NErwan Velu <e.velu@criteo.com>
      [ rjw: Subject & changelog, minor fixups ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      0f006801
    • S
      cpufreq: intel_pstate: Force HWP min perf before offline · 7e3e9112
      Srinivas Pandruvada 提交于
      commit af3b7379e2d709f2d7c6966b8a6f5ec6bd134241 upstream.
      
      Force HWP Request MAX = HWP Request MIN = HWP Capability MIN and EPP to
      0xFF. In this way the performance limits on the offlined CPU will not
      influence performance limits on its sibling CPU, which is still online.
      
      If the sibling CPU is calling for higher performance, it will impact the
      max core performance. Here core performance will follow higher of the
      performance requests from each sibling.
      Reported-and-tested-by: NChen Yu <yu.c.chen@intel.com>
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      7e3e9112
    • M
      dm: add missing trace_block_split() to __split_and_process_bio() · 06596a83
      Mike Snitzer 提交于
      commit 075c18c3e124a1511ebc10a89f1858c8a77dcb01 upstream.
      
      Provides useful context about bio splits in blktrace.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      06596a83
    • M
      dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate · 96dab465
      Mike Snitzer 提交于
      commit 6548c7c538e5658cbce686c2dd1a9b4f5398bf34 upstream.
      
      Otherwise targets that don't support/expect IO splitting could resubmit
      bios using code paths with unnecessary IO splitting complexity.
      
      Depends-on: 24113d487843 ("dm: avoid indirect call in __dm_make_request")
      Fixes: 978e51ba ("dm: optimize bio-based NVMe IO submission")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      96dab465
    • M
      dm: avoid indirect call in __dm_make_request · 8632e566
      Mikulas Patocka 提交于
      commit 24113d4878439baf1f23c1a33dfcc340fba66e97 upstream.
      
      Indirect calls are inefficient because of retpolines that are used for
      spectre workaround. This patch replaces an indirect call with a condition
      (that can be predicted by the branch predictor).
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8632e566
    • M
      dm: fix redundant IO accounting for bios that need splitting · 332d1828
      Mike Snitzer 提交于
      commit a1e1cb72d96491277ede8d257ce6b48a381dd336 upstream.
      
      [Joseph: cherry-pick part_stat_get() from commit 1226b8dd0e91 ("block:
      switch to per-cpu in-flight counters") since we don't want the whole
      patch series get involved.]
      
      The risk of redundant IO accounting was not taken into consideration
      when commit 18a25da8 ("dm: ensure bio submission follows a
      depth-first tree walk") introduced IO splitting in terms of recursion
      via generic_make_request().
      
      Fix this by subtracting the split bio's payload from the IO stats that
      were already accounted for by start_io_acct() upon dm_make_request()
      entry.  This repeat oscillation of the IO accounting, up then down,
      isn't ideal but refactoring DM core's IO splitting to pre-split bios
      _before_ they are accounted turned out to be an excessive amount of
      change that will need a full development cycle to refine and verify.
      
      Before this fix:
      
        /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
        bios are split on 32k boundaries.
      
        # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
          	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers
      
        with debugging added:
        [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
        [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
        [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
        ...
      
        16M written yet 136M (278528 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        278528
      
      After this fix:
      
        16M written and 16M (32768 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        32768
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.16+
      Reported-by: NBryan Gurney <bgurney@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      332d1828
    • M
      dm: fix clone_bio() to trigger blk_recount_segments() · 499c3c32
      Mike Snitzer 提交于
      commit 57c36519e4b949f89381053f7283f5d605595b42 upstream.
      
      DM's clone_bio() now benefits from using bio_trim() by fixing the fact
      that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
      which triggers blk_recount_segments() via bio_phys_segments().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      499c3c32
    • X
      ext4: unlock unused_pages timely when doing writeback · de769041
      Xiaoguang Wang 提交于
      commit a297b2fcee461e40df763e179cbbfba5a9e572d2 upstream.
      
      In mpage_add_bh_to_extent(), when accumulated extents length is greater
      than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we
      will not continue to search unmapped area for this page, but note this
      page is locked, and will only be unlocked in mpage_release_unused_pages()
      after ext4_io_submit, if io also is throttled by blk-throttle or similar
      io qos, we will hold this page locked for a while, it's unnecessary.
      
      I think the best fix is to refactor mpage_add_bh_to_extent() to let it
      return some hints whether to unlock this page, but given that we will
      improve dioread_nolock later, we can let it done later, so currently
      the simple fix would just call mpage_release_unused_pages() before
      ext4_io_submit().
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      de769041
    • S
      sched: disable auto group by default · b5676749
      Shanpei Chen 提交于
      Autogroup feature is used to improve interactivity for desktop
      application. Since our kernel runs on server, just like RHEL8,
      disable it by default to avoid unnecessary computing.
      
      More details, please refer https://lwn.net/Articles/416641/Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b5676749
    • D
      kernel/sched/psi.c: expose pressure metrics on root cgroup · 16925d58
      Dan Schatzberg 提交于
      commit df5ba5be7425e1df296d40c5f37a39d98ec666a2 upstream.
      
      Pressure metrics are already recorded and exposed in procfs for the
      entire system, but any tool which monitors cgroup pressure has to
      special case the root cgroup to read from procfs.  This patch exposes
      the already recorded pressure metrics on the root cgroup.
      
      Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.comSigned-off-by: NDan Schatzberg <dschatzberg@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      16925d58