1. 02 9月, 2020 40 次提交
    • G
      io_uring: Fix NULL pointer dereference in loop_rw_iter() · 58511eb2
      Guoyu Huang 提交于
      fix #29760246
      
      Cherry-pick 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 from io_uring-5.9.
      
      loop_rw_iter() does not check whether the file has a read or
      write function. This can lead to NULL pointer dereference
      when the user passes in a file descriptor that does not have
      read or write function.
      
      The crash log looks like this:
      
      [   99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [   99.835364] #PF: supervisor instruction fetch in kernel mode
      [   99.836522] #PF: error_code(0x0010) - not-present page
      [   99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
      [   99.839649] Oops: 0010 [#2] SMP PTI
      [   99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G      D           5.8.0 #2
      [   99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
      [   99.845140] RIP: 0010:0x0
      [   99.845840] Code: Bad RIP value.
      [   99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
      [   99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
      [   99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
      [   99.856914] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.858651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      [   99.861979] Call Trace:
      [   99.862617]  loop_rw_iter.part.0+0xad/0x110
      [   99.863838]  io_write+0x2ae/0x380
      [   99.864644]  ? kvm_sched_clock_read+0x11/0x20
      [   99.865595]  ? sched_clock+0x9/0x10
      [   99.866453]  ? sched_clock_cpu+0x11/0xb0
      [   99.867326]  ? newidle_balance+0x1d4/0x3c0
      [   99.868283]  io_issue_sqe+0xd8f/0x1340
      [   99.869216]  ? __switch_to+0x7f/0x450
      [   99.870280]  ? __switch_to_asm+0x42/0x70
      [   99.871254]  ? __switch_to_asm+0x36/0x70
      [   99.872133]  ? lock_timer_base+0x72/0xa0
      [   99.873155]  ? switch_mm_irqs_off+0x1bf/0x420
      [   99.874152]  io_wq_submit_work+0x64/0x180
      [   99.875192]  ? kthread_use_mm+0x71/0x100
      [   99.876132]  io_worker_handle_work+0x267/0x440
      [   99.877233]  io_wqe_worker+0x297/0x350
      [   99.878145]  kthread+0x112/0x150
      [   99.878849]  ? __io_worker_unuse+0x100/0x100
      [   99.879935]  ? kthread_park+0x90/0x90
      [   99.880874]  ret_from_fork+0x22/0x30
      [   99.881679] Modules linked in:
      [   99.882493] CR2: 0000000000000000
      [   99.883324] ---[ end trace 4453745f4673190b ]---
      [   99.884289] RIP: 0010:0x0
      [   99.884837] Code: Bad RIP value.
      [   99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
      [   99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
      [   99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
      [   99.896079] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.898017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      
      Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
      Cc: stable@vger.kernel.org
      Signed-off-by: NGuoyu Huang <hgy5945@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      58511eb2
    • X
      io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works · cd40824f
      Xiaoguang Wang 提交于
      fix #29605829
      
      commit 23b3628e45924419399da48c2b3a522b05557c91 upstream
      
      In io_sq_thread(), if there are task works to handle, current codes
      will skip schedule() and go on polling sq again, but forget to clear
      IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
      set and clear IORING_SQ_NEED_WAKEUP flag,
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd40824f
    • P
      io_uring: fix lockup in io_fail_links() · 18ca51d0
      Pavel Begunkov 提交于
      to #29608102
      
      commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5 upstream.
      
      io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested
      spin_lock(completion_lock) and lockup.
      
      [  197.680409] rcu: INFO: rcu_preempt detected expedited stalls on
      	CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/.
      [  197.680411] rcu: blocking rcu_node structures:
      [  197.680412] Task dump for CPU 6:
      [  197.680413] link-timeout    R  running task        0  1669
      	1 0x8000008a
      [  197.680414] Call Trace:
      [  197.680420]  ? io_req_find_next+0xa0/0x200
      [  197.680422]  ? io_put_req_find_next+0x2a/0x50
      [  197.680423]  ? io_poll_task_func+0xcf/0x140
      [  197.680425]  ? task_work_run+0x67/0xa0
      [  197.680426]  ? do_exit+0x35d/0xb70
      [  197.680429]  ? syscall_trace_enter+0x187/0x2c0
      [  197.680430]  ? do_group_exit+0x43/0xa0
      [  197.680448]  ? __x64_sys_exit_group+0x18/0x20
      [  197.680450]  ? do_syscall_64+0x52/0xa0
      [  197.680452]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      18ca51d0
    • P
      io_uring: fix ->work corruption with poll_add · 7943eeed
      Pavel Begunkov 提交于
      to #29608102
      
      commit d5e16d8e23825304c6a9945116cc6b6f8d51f28c upstream.
      
      req->work might be already initialised by the time it gets into
      __io_arm_poll_handler(), which will corrupt it by using fields that are
      in an union with req->work. Luckily, the only side effect is missing
      put_creds(). Clean req->work before going there.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7943eeed
    • P
      io_uring: missed req_init_async() for IOSQE_ASYNC · e6508674
      Pavel Begunkov 提交于
      to #29608102
      
      commit 3e863ea3bb1a2203ae648eb272db0ce6a1a2072c upstream.
      
      IOSQE_ASYNC branch of io_queue_sqe() is another place where an
      unitialised req->work can be accessed (i.e. prior io_req_init_async()).
      Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e6508674
    • D
      io_uring: always allow drain/link/hardlink/async sqe flags · 6e5cbea1
      Daniele Albano 提交于
      to #29608102
      
      commit 61710e437f2807e26a3402543bdbb7217a9c8620 upstream.
      
      We currently filter these for timeout_remove/async_cancel/files_update,
      but we only should be filtering for fixed file and buffer select. This
      also causes a second read of sqe->flags, which isn't needed.
      
      Just check req->flags for the relevant bits. This then allows these
      commands to be used in links, for example, like everything else.
      Signed-off-by: NDaniele Albano <d.albano@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6e5cbea1
    • J
      io_uring: ensure double poll additions work with both request types · 8d834b0c
      Jens Axboe 提交于
      to #29608102
      
      commit 807abcb0883439af5ead73f3308310453b97b624 upstream.
      
      The double poll additions were centered around doing POLL_ADD on file
      descriptors that use more than one waitqueue (typically one for read,
      one for write) when being polled. However, it can also end up being
      triggered for when we use poll triggered retry. For that case, we cannot
      safely use req->io, as that could be used by the request type itself.
      
      Add a second io_poll_iocb pointer in the structure we allocate for poll
      based retry, and ensure we use the right one from the two paths.
      
      Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8d834b0c
    • Y
      ovl: initialize error in ovl_copy_xattr · 7db5692f
      Yuxuan Shui 提交于
      to #28557782
      
      commit 520da69d265a91c6536c63851cbb8a53946974f0 upstream.
      
      In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
      xattrs, the copy loop will terminate without assigning anything to the
      error variable, thus returning an uninitialized value.
      
      If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
      value is put into a pointer by ERR_PTR(), causing potential invalid memory
      accesses down the line.
      
      This commit initialize error with 0. This is the correct value because when
      there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
      should succeed.
      
      This bug is discovered with the help of INIT_STACK_ALL and clang.
      Signed-off-by: NYuxuan Shui <yshuiv7@gmail.com>
      Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
      Fixes: 0956254a ("ovl: don't copy up opaqueness")
      Cc: stable@vger.kernel.org # v4.8
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7db5692f
    • Z
      xfs: add agf freeblocks verify in xfs_agf_verify · 028b4911
      Zheng Bin 提交于
      to #28557760
      
      [ Upstream commit d0c7feaf87678371c2c09b3709400be416b2dc62 ]
      
      We recently used fuzz(hydra) to test XFS and automatically generate
      tmp.img(XFS v5 format, but some metadata is wrong)
      
      xfs_repair information(just one AG):
      agf_freeblks 0, counted 3224 in ag 0
      agf_longest 536874136, counted 3224 in ag 0
      sb_fdblocks 613, counted 3228
      
      Test as follows:
      mount tmp.img tmpdir
      cp file1M tmpdir
      sync
      
      In 4.19-stable, sync will stuck, the reason is:
      xfs_mountfs
        xfs_check_summary_counts
          if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
             XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
             !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
      	return 0;  -->just return, incore sb_fdblocks still be 613
          xfs_initialize_perag_data
      
      cp file1M tmpdir -->ok(write file to pagecache)
      sync -->stuck(write pagecache to disk)
      xfs_map_blocks
        xfs_iomap_write_allocate
          while (count_fsb != 0) {
            nimaps = 0;
            while (nimaps == 0) { --> endless loop
               nimaps = 1;
               xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
      xfs_bmapi_write
        xfs_bmap_alloc
          xfs_bmap_btalloc
            xfs_alloc_vextent
              xfs_alloc_fix_freelist
                xfs_alloc_space_available -->fail(agf_freeblks is 0)
      
      In linux-next, sync not stuck, cause commit c2b3164320b5 ("xfs:
      use the latest extent at writeback delalloc conversion time") remove
      the above while, dmesg is as follows:
      [   55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.
      
      Users do not know why this page is discard, the better soultion is:
      1. Like xfs_repair, make sure sb_fdblocks is equal to counted
      (xfs_initialize_perag_data did this, who is not called at this mount)
      2. Add agf verify, if fail, will tell users to repair
      
      This patch use the second soultion.
      Signed-off-by: NZheng Bin <zhengbin13@huawei.com>
      Signed-off-by: NRen Xudong <renxudong1@huawei.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      028b4911
    • E
      ext4: fix race between ext4_sync_parent() and rename() · 727bd990
      Eric Biggers 提交于
      to #28557685
      
      commit 08adf452e628b0e2ce9a01048cfbec52353703d7 upstream.
      
      'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is
      broken because without d_lock, d_parent can be concurrently changed due
      to a rename().  Then if the old directory is immediately deleted, old
      d_parent->inode can be NULL.  That causes a NULL dereference in igrab().
      
      To fix this, use dget_parent() to safely grab a reference to the parent
      dentry, which pins the inode.  This also eliminates the need to use
      d_find_any_alias() other than for the initial inode, as we no longer
      throw away the dentry at each step.
      
      This is an extremely hard race to hit, but it is possible.  Adding a
      udelay() in between the reads of ->d_parent and its ->d_inode makes it
      reproducible on a no-journal filesystem using the following program:
      
          #include <fcntl.h>
          #include <unistd.h>
      
          int main()
          {
              if (fork()) {
                  for (;;) {
                      mkdir("dir1", 0700);
                      int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC);
                      write(fd, "X", 1);
                      close(fd);
                  }
              } else {
                  mkdir("dir2", 0700);
                  for (;;) {
                      rename("dir1/file", "dir2/file");
                      rmdir("dir1");
                  }
              }
          }
      
      Fixes: d59729f4 ("ext4: fix races in ext4_sync_parent()")
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      727bd990
    • H
      ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max · d9bf1840
      Harshad Shirwadkar 提交于
      to #28557685
      
      commit c36a71b4e35ab35340facdd6964a00956b9fef0a upstream.
      
      If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned
      (-1) resulting in illegal memory accesses. Although there is no
      consistent repro, we see that generic/019 sometimes crashes because of
      this bug.
      
      Ran gce-xfstests smoke and verified that there were no regressions.
      Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d9bf1840
    • E
      ext4: disable dioread_nolock whenever delayed allocation is disabled · 8c6a9862
      Eric Whitney 提交于
      fix #29455282
      
      commit c8980e1980ccdc2229aa2218d532ddc62e0aabe5 upstream
      
      The patch "ext4: make dioread_nolock the default" (244adf6426ee) causes
      generic/422 to fail when run in kvm-xfstests' ext3conv test case.  This
      applies both the dioread_nolock and nodelalloc mount options, a
      combination not previously tested by kvm-xfstests.  The failure occurs
      because the dioread_nolock code path splits a previously fallocated
      multiblock extent into a series of single block extents when overwriting
      a portion of that extent.  That causes allocation of an extent tree leaf
      node and a reshuffling of extents.  Once writeback is completed, the
      individual extents are recombined into a single extent, the extent is
      moved again, and the leaf node is deleted.  The difference in block
      utilization before and after writeback due to the leaf node triggers the
      failure.
      
      The original reason for this behavior was to avoid ENOSPC when handling
      I/O completions during writeback in the dioread_nolock code paths when
      delayed allocation is disabled.  It may no longer be necessary, because
      code was added in the past to reserve extra space to solve this problem
      when delayed allocation is enabled, and this code may also apply when
      delayed allocation is disabled.  Until this can be verified, don't use
      the dioread_nolock code paths if delayed allocation is disabled.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Link: https://lore.kernel.org/r/20200319150028.24592-1-enwlinux@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8c6a9862
    • L
      alinux: virtiofs: simplify mount options · d068535c
      Liu Bo 提交于
      task #28910367
      Rather than explicitly specifying "-o
      default_permissions,allow_other", virtiofs can set some default values
      for them.
      
      With this, we can simply do
      "mount -t virtio_fs atest /mnt/test/ -otag=myfs-1,dax".
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d068535c
    • L
      alinux: virtio-fs: export fuse_request_free · e6067150
      Liu Bo 提交于
      task #28910367
      virtio-fs will need to use it from outside fs/fuse/dev.c.
      Make the symbol visible.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e6067150
    • V
      fuse: Support RENAME_WHITEOUT flag · ebea99bf
      Vivek Goyal 提交于
      task #28910367
      commit 519525fa47b5a8155f0b203e49a3a6a2319f75ae upstream
      
      Allow fuse to pass RENAME_WHITEOUT to fuse server.  Overlayfs on top of
      virtiofs uses RENAME_WHITEOUT.
      
      Without this patch renaming a directory in overlayfs (dir is on lower)
      fails with -EINVAL. With this patch it works.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 519525fa47b5a8155f0b203e49a3a6a2319f75ae)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ebea99bf
    • V
      virtiofs: Use completions while waiting for queue to be drained · 88fa38fa
      Vivek Goyal 提交于
      task #28910367
      commit 724c15a43e2c7ac26e2d07abef99191162498fa9 upstream
      
      While we wait for queue to finish draining, use completions instead of
      usleep_range(). This is better way of waiting for event.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 724c15a43e2c7ac26e2d07abef99191162498fa9)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      88fa38fa
    • V
      virtiofs: Do not send forget request "struct list_head" element · 2a6ae53e
      Vivek Goyal 提交于
      task #28910367
      commit 1efcf39eb627573f8d543ea396cf36b0651b1e56 upstream
      
      We are sending whole of virtio_fs_forget struct to the other end over
      virtqueue. Other end does not need to see elements like "struct list".
      That's internal detail of guest kernel. Fix it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 1efcf39eb627573f8d543ea396cf36b0651b1e56)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2a6ae53e
    • V
      virtiofs: Use a common function to send forget · a6d9f512
      Vivek Goyal 提交于
      task #28910367
      commit 58ada94f95f71d4f73197ab0e9603dbba6e47fe3 upstream
      
      Currently we are duplicating logic to send forgets at two
      places. Consolidate the code by calling one helper function.
      
      This also uses virtqueue_add_outbuf() instead of
      virtqueue_add_sgs(). Former is simpler to call.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 58ada94f95f71d4f73197ab0e9603dbba6e47fe3)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a6d9f512
    • Y
      virtiofs: Fix old-style declaration · 8270fcad
      YueHaibing 提交于
      task #28910367
      commit 00929447f5758c4f64c74d0a4b40a6eb3d9df0e3 upstream
      
      There expect the 'static' keyword to come first in a declaration, and we
      get warnings like this with "make W=1":
      
      fs/fuse/virtio_fs.c:687:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
      fs/fuse/virtio_fs.c:692:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
      fs/fuse/virtio_fs.c:1029:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 00929447f5758c4f64c74d0a4b40a6eb3d9df0e3)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8270fcad
    • Z
      virtiofs: Remove set but not used variable 'fc' · 823286b7
      zhengbin 提交于
      task #28910367
      commit 80da5a809d193c60d090cbdf4fe316781bc07965 upstream
      
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      fs/fuse/virtio_fs.c: In function virtio_fs_wake_pending_and_unlock:
      fs/fuse/virtio_fs.c:983:20: warning: variable fc set but not used [-Wunused-but-set-variable]
      
      It is not used since commit 7ee1e2e631db ("virtiofs: No need to check
      fpq->connected state")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      823286b7
    • V
      virtiofs: Retry request submission from worker context · 986957da
      Vivek Goyal 提交于
      task #28910367
      commit a9bfd9dd3417561d06c81de04f6d6c1e0c9b3d44 upstream
      
      If regular request queue gets full, currently we sleep for a bit and
      retrying submission in submitter's context. This assumes submitter is not
      holding any spin lock. But this assumption is not true for background
      requests. For background requests, we are called with fc->bg_lock held.
      
      This can lead to deadlock where one thread is trying submission with
      fc->bg_lock held while request completion thread has called
      fuse_request_end() which tries to acquire fc->bg_lock and gets blocked. As
      request completion thread gets blocked, it does not make further progress
      and that means queue does not get empty and submitter can't submit more
      requests.
      
      To solve this issue, retry submission with the help of a worker, instead of
      retrying in submitter's context. We already do this for hiprio/forget
      requests.
      Reported-by: NChirantan Ekbote <chirantan@chromium.org>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit a9bfd9dd3417561d06c81de04f6d6c1e0c9b3d44)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      986957da
    • V
      virtiofs: Count pending forgets as in_flight forgets · ab040ec8
      Vivek Goyal 提交于
      task #28910367
      
      commit c17ea009610366146ec409fd6dc277e0f2510b10 upstream
      
      If virtqueue is full, we put forget requests on a list and these forgets
      are dispatched later using a worker. As of now we don't count these forgets
      in fsvq->in_flight variable. This means when queue is being drained, we
      have to have special logic to first drain these pending requests and then
      wait for fsvq->in_flight to go to zero.
      
      By counting pending forgets in fsvq->in_flight, we can get rid of special
      logic and just wait for in_flight to go to zero. Worker thread will kick
      and drain all the forgets anyway, leading in_flight to zero.
      
      I also need similar logic for normal request queue in next patch where I am
      about to defer request submission in the worker context if queue is full.
      
      This simplifies the code a bit.
      
      Also add two helper functions to inc/dec in_flight. Decrement in_flight
      helper will later used to call completion when in_flight reaches zero.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit c17ea009610366146ec409fd6dc277e0f2510b10)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ab040ec8
    • V
      virtiofs: Set FR_SENT flag only after request has been sent · f5fa6847
      Vivek Goyal 提交于
      task #28910367
      commit 5dbe190f341206a7896f7e40c1e3a36933d812f3 upstream
      
      FR_SENT flag should be set when request has been sent successfully sent
      over virtqueue. This is used by interrupt logic to figure out if interrupt
      request should be sent or not.
      
      Also add it to fqp->processing list after sending it successfully.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f5fa6847
    • V
      virtiofs: No need to check fpq->connected state · da960fb2
      Vivek Goyal 提交于
      task #28910367
      commit 7ee1e2e631dbf0ff0df2a67a1e01ba3c1dce7a46 upstream
      
      In virtiofs we keep per queue connected state in virtio_fs_vq->connected
      and use that to end request if queue is not connected. And virtiofs does
      not even touch fpq->connected state.
      
      We probably need to merge these two at some point of time. For now,
      simplify the code a bit and do not worry about checking state of
      fpq->connected.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      da960fb2
    • V
      virtiofs: Do not end request in submission context · c3694064
      Vivek Goyal 提交于
      task #28910367
      commit 51fecdd2555b3e0e05a78d30093c638d164a32f9 upstream
      
      Submission context can hold some locks which end request code tries to hold
      again and deadlock can occur. For example, fc->bg_lock. If a background
      request is being submitted, it might hold fc->bg_lock and if we could not
      submit request (because device went away) and tried to end request, then
      deadlock happens. During testing, I also got a warning from deadlock
      detection code.
      
      So put requests on a list and end requests from a worker thread.
      
      I got following warning from deadlock detector.
      
      [  603.137138] WARNING: possible recursive locking detected
      [  603.137142] --------------------------------------------
      [  603.137144] blogbench/2036 is trying to acquire lock:
      [  603.137149] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_request_end+0xdf/0x1c0 [fuse]
      [  603.140701]
      [  603.140701] but task is already holding lock:
      [  603.140703] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_simple_background+0x92/0x1d0 [fuse]
      [  603.140713]
      [  603.140713] other info that might help us debug this:
      [  603.140714]  Possible unsafe locking scenario:
      [  603.140714]
      [  603.140715]        CPU0
      [  603.140716]        ----
      [  603.140716]   lock(&(&fc->bg_lock)->rlock);
      [  603.140718]   lock(&(&fc->bg_lock)->rlock);
      [  603.140719]
      [  603.140719]  *** DEADLOCK ***
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      c3694064
    • V
      virtio-fs: Change module name to virtiofs.ko · 5cb5b717
      Vivek Goyal 提交于
      task #28910367
      commit 112e72373d1f60f1e4558d0a7f0de5da39a1224d upstream
      
      We have been calling it virtio_fs and even file name is virtio_fs.c. Module
      name is virtio_fs.ko but when registering file system user is supposed to
      specify filesystem type as "virtiofs".
      
      Masayoshi Mizuma reported that he specified filesytem type as "virtio_fs"
      and got this warning on console.
      
        ------------[ cut here ]------------
        request_module fs-virtio_fs succeeded, but still no fs?
        WARNING: CPU: 1 PID: 1234 at fs/filesystems.c:274 get_fs_type+0x12c/0x138
        Modules linked in: ... virtio_fs fuse virtio_net net_failover ...
        CPU: 1 PID: 1234 Comm: mount Not tainted 5.4.0-rc1 #1
      
      So looks like kernel could find the module virtio_fs.ko but could not find
      filesystem type after that.
      
      It probably is better to rename module name to virtiofs.ko so that above
      warning goes away in case user ends up specifying wrong fs name.
      Reported-by: NMasayoshi Mizuma <msys.mizuma@gmail.com>
      Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 112e72373d1f60f1e4558d0a7f0de5da39a1224d)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      5cb5b717
    • S
      virtio-fs: add virtiofs filesystem · 917f6dfb
      Stefan Hajnoczi 提交于
      task #28910367
      commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a upstream
      
      Add a basic file system module for virtio-fs.  This does not yet contain
      shared data support between host and guest or metadata coherency speedups.
      However it is already significantly faster than virtio-9p.
      
      Design Overview
      ===============
      
      With the goal of designing something with better performance and local file
      system semantics, a bunch of ideas were proposed.
      
       - Use fuse protocol (instead of 9p) for communication between guest and
         host.  Guest kernel will be fuse client and a fuse server will run on
         host to serve the requests.
      
       - For data access inside guest, mmap portion of file in QEMU address space
         and guest accesses this memory using dax.  That way guest page cache is
         bypassed and there is only one copy of data (on host).  This will also
         enable mmap(MAP_SHARED) between guests.
      
       - For metadata coherency, there is a shared memory region which contains
         version number associated with metadata and any guest changing metadata
         updates version number and other guests refresh metadata on next access.
         This is yet to be implemented.
      
      How virtio-fs differs from existing approaches
      ==============================================
      
      The unique idea behind virtio-fs is to take advantage of the co-location of
      the virtual machine and hypervisor to avoid communication (vmexits).
      
      DAX allows file contents to be accessed without communication with the
      hypervisor.  The shared memory region for metadata avoids communication in
      the common case where metadata is unchanged.
      
      By replacing expensive communication with cheaper shared memory accesses,
      we expect to achieve better performance than approaches based on network
      file system protocols.  In addition, this also makes it easier to achieve
      local file system semantics (coherency).
      
      These techniques are not applicable to network file system protocols since
      the communications channel is bypassed by taking advantage of shared memory
      on a local machine.  This is why we decided to build virtio-fs rather than
      focus on 9P or NFS.
      
      Caching Modes
      =============
      
      Like virtio-9p, different caching modes are supported which determine the
      coherency level as well.  The “cache=FOO” and “writeback” options control
      the level of coherence between the guest and host filesystems.
      
       - cache=none
         metadata, data and pathname lookup are not cached in guest.  They are
         always fetched from host and any changes are immediately pushed to host.
      
       - cache=always
         metadata, data and pathname lookup are cached in guest and never expire.
      
       - cache=auto
         metadata and pathname lookup cache expires after a configured amount of
         time (default is 1 second).  Data is cached while the file is open
         (close to open consistency).
      
       - writeback/no_writeback
         These options control the writeback strategy.  If writeback is disabled,
         then normal writes will immediately be synchronized with the host fs.
         If writeback is enabled, then writes may be cached in the guest until
         the file is closed or an fsync(2) performed.  This option has no effect
         on mmap-ed writes or writes going through the DAX mechanism.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      
      (cherry picked from commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a)
      [Liubo: given that 4.19 lacks the support of fs_context to parse mount
      option, here I just change it back to the 4.19 way, so we still use -o
      tag=myfs-1 to get virtiofs mount.]
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      917f6dfb
    • M
      fuse: delete dentry if timeout is zero · 9ffcf1ac
      Miklos Szeredi 提交于
      task #28910367
      commit 8fab010644363f8f80194322aa7a81e38c867af3 upstream
      
      Don't hold onto dentry in lru list if need to re-lookup it anyway at next
      access.  Only do this if explicitly enabled, otherwise it could result in
      performance regression.
      
      More advanced version of this patch would periodically flush out dentries
      from the lru which have gone stale.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9ffcf1ac
    • V
      fuse: Export fuse_dequeue_forget() function · 09db4841
      Vivek Goyal 提交于
      task #28910367
      commit 4388c5aac4bae5c83a2c66882043942002ba09a2 upstream
      
      stacked file systems like virtio-fs do not have to play directly with
      forget list data structures. There is a helper function use that instead.
      
      Rename dequeue_forget() to fuse_dequeue_forget() and export it so that
      stacked filesystems can use it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      09db4841
    • S
      fuse: export fuse_get_unique() · f96b6dd6
      Stefan Hajnoczi 提交于
      task #28910367
      commit 79d96efffda7597b41968d5d8813b39fc2965f1b upstream
      
      virtio-fs will need unique IDs for FORGET requests from outside
      fs/fuse/dev.c.  Make the symbol visible.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f96b6dd6
    • V
      fuse: Separate fuse device allocation and installation in fuse_conn · 0213de76
      Vivek Goyal 提交于
      task #28910367
      commit 0cd1eb9a4160a96e0ec9b93b2e7b489f449bf22d upstream
      
      As of now fuse_dev_alloc() both allocates a fuse device and installs it
      in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation
      fails.
      
      virtio-fs needs to initialize multiple fuse devices (one per virtio
      queue). It initializes one fuse device as part of call to
      fuse_fill_super_common() and rest of the devices are allocated and
      installed after that.
      
      But, we can't affort to fail after calling fuse_fill_super_common() as
      we don't have a way to undo all the actions done by fuse_fill_super_common().
      So to avoid failures after the call to fuse_fill_super_common(),
      pre-allocate all fuse devices early and install them into fuse connection
      later.
      
      This patch provides two separate helpers for fuse device allocation and
      fuse device installation in fuse_conn.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0213de76
    • S
      fuse: add fuse_iqueue_ops callbacks · 57f16587
      Stefan Hajnoczi 提交于
      task #28910367
      commit ae3aad77f46fbba56eff7141b2fc49870b60827e upstream
      
      The /dev/fuse device uses fiq->waitq and fasync to signal that requests
      are available.  These mechanisms do not apply to virtio-fs.  This patch
      introduces callbacks so alternative behavior can be used.
      
      Note that queue_interrupt() changes along these lines:
      
        spin_lock(&fiq->waitq.lock);
        wake_up_locked(&fiq->waitq);
      + kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
        spin_unlock(&fiq->waitq.lock);
      - kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
      
      Since queue_request() and queue_forget() also call kill_fasync() inside
      the spinlock this should be safe.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      57f16587
    • V
      fuse: Export fuse_send_init_request() · 6769b1fd
      Vivek Goyal 提交于
      task #28910367
      commit 95a84cdb11c26315a6d34664846f82c438c961a1 upstream
      
      This will be used by virtio-fs to send init request to fuse server after
      initialization of virt queues.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6769b1fd
    • S
      fuse: export fuse_len_args() · 609c1cf3
      Stefan Hajnoczi 提交于
      task #28910367
      commit 14d46d7abc3973a47e8eb0eb5eb87ee8d910a505 upstream
      
      virtio-fs will need to query the length of fuse_arg lists.  Make the
      symbol visible.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      609c1cf3
    • S
      fuse: export fuse_end_request() · 63b1ffab
      Stefan Hajnoczi 提交于
      task #28910367
      commit 04ec5af0776e9baefed59891f12adbcb5fa71a23 upstream
      
      virtio-fs will need to complete requests from outside fs/fuse/dev.c.
      Make the symbol visible.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      63b1ffab
    • S
      fuse: extract fuse_fill_super_common() · 0fdc23c2
      Stefan Hajnoczi 提交于
      task #28910367
      commit 0cc2656cdb0b1f234e6d29378cb061e29d7522bc upstream
      
      fuse_fill_super() includes code to process the fd= option and link the
      struct fuse_dev to the fd's struct file.  In virtio-fs there is no file
      descriptor because /dev/fuse is not used.
      
      This patch extracts fuse_fill_super_common() so that both classic fuse
      and virtio-fs can share the code to initialize a mount.
      
      parse_fuse_opt() is also extracted so that the fuse_fill_super_common()
      caller has access to the mount options.  This allows classic fuse to
      handle the fd= option outside fuse_fill_super_common().
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0fdc23c2
    • D
      mm: convert PG_balloon to PG_offline · 59df23d6
      David Hildenbrand 提交于
      task #29077503
      commit ca215086b14b89a0e70fc211314944aa6ce50020 upstream
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      (cherry picked from ccommit ca215086b14b89a0e70fc211314944aa6ce50020)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      59df23d6
    • P
      io_uring: fix recvmsg memory leak with buffer selection · fe15220e
      Pavel Begunkov 提交于
      to #29441901
      
      commit 681fda8d27a66f7e65ff7f2d200d7635e64a8d05 upstream.
      
      io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
      causes a leak when used with automatic buffer selection.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fe15220e
    • Y
      alinux: sched: Add cpu_stress to show system-wide task waiting · ab81d2d9
      Yihao Wu 提交于
      to #28739709
      
      /proc/loadavg can reflex the waiting tasks over a period of time
      to some extent. But to become a SLI requires better precision and
      quicker response. Furthermore, I/O block is not concerned here,
      and bandwidth control is excluded from cpu_stress.
      
      This patch adds a new interface /proc/cpu_stress. It's based on
      task runtime tracking so we don't need to deal with complex state
      transition. And because task runtime tracking is done in most
      scheduler events, the precision is quite enough.
      
      Like loadavg, cpu_stress has 3 average windows too (1,5,15 min)
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ab81d2d9
    • Y
      ovl: inode reference leak in ovl_is_inuse true case. · b247d8a6
      youngjun 提交于
      to #29273482
      
      When "ovl_is_inuse" true case, trap inode reference not put.  plus adding
      the comment explaining sequence of ovl_is_inuse after ovl_setup_trap.
      
      Fixes: 0be0bfd2de9d ("ovl: fix regression caused by overlapping layers detection")
      Cc: <stable@vger.kernel.org> # v4.19+
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: Nyoungjun <her0gyugyu@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git/commit/?h=overlayfs-next&id=24f14009b8f1754ec2ae4c168940c01259b0f88aSigned-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b247d8a6