1. 02 9月, 2020 40 次提交
    • Y
      ovl: initialize error in ovl_copy_xattr · 7db5692f
      Yuxuan Shui 提交于
      to #28557782
      
      commit 520da69d265a91c6536c63851cbb8a53946974f0 upstream.
      
      In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
      xattrs, the copy loop will terminate without assigning anything to the
      error variable, thus returning an uninitialized value.
      
      If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
      value is put into a pointer by ERR_PTR(), causing potential invalid memory
      accesses down the line.
      
      This commit initialize error with 0. This is the correct value because when
      there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
      should succeed.
      
      This bug is discovered with the help of INIT_STACK_ALL and clang.
      Signed-off-by: NYuxuan Shui <yshuiv7@gmail.com>
      Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
      Fixes: 0956254a ("ovl: don't copy up opaqueness")
      Cc: stable@vger.kernel.org # v4.8
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7db5692f
    • Z
      xfs: add agf freeblocks verify in xfs_agf_verify · 028b4911
      Zheng Bin 提交于
      to #28557760
      
      [ Upstream commit d0c7feaf87678371c2c09b3709400be416b2dc62 ]
      
      We recently used fuzz(hydra) to test XFS and automatically generate
      tmp.img(XFS v5 format, but some metadata is wrong)
      
      xfs_repair information(just one AG):
      agf_freeblks 0, counted 3224 in ag 0
      agf_longest 536874136, counted 3224 in ag 0
      sb_fdblocks 613, counted 3228
      
      Test as follows:
      mount tmp.img tmpdir
      cp file1M tmpdir
      sync
      
      In 4.19-stable, sync will stuck, the reason is:
      xfs_mountfs
        xfs_check_summary_counts
          if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
             XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
             !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
      	return 0;  -->just return, incore sb_fdblocks still be 613
          xfs_initialize_perag_data
      
      cp file1M tmpdir -->ok(write file to pagecache)
      sync -->stuck(write pagecache to disk)
      xfs_map_blocks
        xfs_iomap_write_allocate
          while (count_fsb != 0) {
            nimaps = 0;
            while (nimaps == 0) { --> endless loop
               nimaps = 1;
               xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
      xfs_bmapi_write
        xfs_bmap_alloc
          xfs_bmap_btalloc
            xfs_alloc_vextent
              xfs_alloc_fix_freelist
                xfs_alloc_space_available -->fail(agf_freeblks is 0)
      
      In linux-next, sync not stuck, cause commit c2b3164320b5 ("xfs:
      use the latest extent at writeback delalloc conversion time") remove
      the above while, dmesg is as follows:
      [   55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.
      
      Users do not know why this page is discard, the better soultion is:
      1. Like xfs_repair, make sure sb_fdblocks is equal to counted
      (xfs_initialize_perag_data did this, who is not called at this mount)
      2. Add agf verify, if fail, will tell users to repair
      
      This patch use the second soultion.
      Signed-off-by: NZheng Bin <zhengbin13@huawei.com>
      Signed-off-by: NRen Xudong <renxudong1@huawei.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      028b4911
    • E
      ext4: fix race between ext4_sync_parent() and rename() · 727bd990
      Eric Biggers 提交于
      to #28557685
      
      commit 08adf452e628b0e2ce9a01048cfbec52353703d7 upstream.
      
      'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is
      broken because without d_lock, d_parent can be concurrently changed due
      to a rename().  Then if the old directory is immediately deleted, old
      d_parent->inode can be NULL.  That causes a NULL dereference in igrab().
      
      To fix this, use dget_parent() to safely grab a reference to the parent
      dentry, which pins the inode.  This also eliminates the need to use
      d_find_any_alias() other than for the initial inode, as we no longer
      throw away the dentry at each step.
      
      This is an extremely hard race to hit, but it is possible.  Adding a
      udelay() in between the reads of ->d_parent and its ->d_inode makes it
      reproducible on a no-journal filesystem using the following program:
      
          #include <fcntl.h>
          #include <unistd.h>
      
          int main()
          {
              if (fork()) {
                  for (;;) {
                      mkdir("dir1", 0700);
                      int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC);
                      write(fd, "X", 1);
                      close(fd);
                  }
              } else {
                  mkdir("dir2", 0700);
                  for (;;) {
                      rename("dir1/file", "dir2/file");
                      rmdir("dir1");
                  }
              }
          }
      
      Fixes: d59729f4 ("ext4: fix races in ext4_sync_parent()")
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      727bd990
    • H
      ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max · d9bf1840
      Harshad Shirwadkar 提交于
      to #28557685
      
      commit c36a71b4e35ab35340facdd6964a00956b9fef0a upstream.
      
      If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned
      (-1) resulting in illegal memory accesses. Although there is no
      consistent repro, we see that generic/019 sometimes crashes because of
      this bug.
      
      Ran gce-xfstests smoke and verified that there were no regressions.
      Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d9bf1840
    • E
      ext4: disable dioread_nolock whenever delayed allocation is disabled · 8c6a9862
      Eric Whitney 提交于
      fix #29455282
      
      commit c8980e1980ccdc2229aa2218d532ddc62e0aabe5 upstream
      
      The patch "ext4: make dioread_nolock the default" (244adf6426ee) causes
      generic/422 to fail when run in kvm-xfstests' ext3conv test case.  This
      applies both the dioread_nolock and nodelalloc mount options, a
      combination not previously tested by kvm-xfstests.  The failure occurs
      because the dioread_nolock code path splits a previously fallocated
      multiblock extent into a series of single block extents when overwriting
      a portion of that extent.  That causes allocation of an extent tree leaf
      node and a reshuffling of extents.  Once writeback is completed, the
      individual extents are recombined into a single extent, the extent is
      moved again, and the leaf node is deleted.  The difference in block
      utilization before and after writeback due to the leaf node triggers the
      failure.
      
      The original reason for this behavior was to avoid ENOSPC when handling
      I/O completions during writeback in the dioread_nolock code paths when
      delayed allocation is disabled.  It may no longer be necessary, because
      code was added in the past to reserve extra space to solve this problem
      when delayed allocation is enabled, and this code may also apply when
      delayed allocation is disabled.  Until this can be verified, don't use
      the dioread_nolock code paths if delayed allocation is disabled.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Link: https://lore.kernel.org/r/20200319150028.24592-1-enwlinux@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8c6a9862
    • L
      alinux: virtiofs: simplify mount options · d068535c
      Liu Bo 提交于
      task #28910367
      Rather than explicitly specifying "-o
      default_permissions,allow_other", virtiofs can set some default values
      for them.
      
      With this, we can simply do
      "mount -t virtio_fs atest /mnt/test/ -otag=myfs-1,dax".
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d068535c
    • L
      alinux: virtio-fs: export fuse_request_free · e6067150
      Liu Bo 提交于
      task #28910367
      virtio-fs will need to use it from outside fs/fuse/dev.c.
      Make the symbol visible.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e6067150
    • V
      fuse: Support RENAME_WHITEOUT flag · ebea99bf
      Vivek Goyal 提交于
      task #28910367
      commit 519525fa47b5a8155f0b203e49a3a6a2319f75ae upstream
      
      Allow fuse to pass RENAME_WHITEOUT to fuse server.  Overlayfs on top of
      virtiofs uses RENAME_WHITEOUT.
      
      Without this patch renaming a directory in overlayfs (dir is on lower)
      fails with -EINVAL. With this patch it works.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 519525fa47b5a8155f0b203e49a3a6a2319f75ae)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ebea99bf
    • V
      virtiofs: Use completions while waiting for queue to be drained · 88fa38fa
      Vivek Goyal 提交于
      task #28910367
      commit 724c15a43e2c7ac26e2d07abef99191162498fa9 upstream
      
      While we wait for queue to finish draining, use completions instead of
      usleep_range(). This is better way of waiting for event.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 724c15a43e2c7ac26e2d07abef99191162498fa9)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      88fa38fa
    • V
      virtiofs: Do not send forget request "struct list_head" element · 2a6ae53e
      Vivek Goyal 提交于
      task #28910367
      commit 1efcf39eb627573f8d543ea396cf36b0651b1e56 upstream
      
      We are sending whole of virtio_fs_forget struct to the other end over
      virtqueue. Other end does not need to see elements like "struct list".
      That's internal detail of guest kernel. Fix it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 1efcf39eb627573f8d543ea396cf36b0651b1e56)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2a6ae53e
    • V
      virtiofs: Use a common function to send forget · a6d9f512
      Vivek Goyal 提交于
      task #28910367
      commit 58ada94f95f71d4f73197ab0e9603dbba6e47fe3 upstream
      
      Currently we are duplicating logic to send forgets at two
      places. Consolidate the code by calling one helper function.
      
      This also uses virtqueue_add_outbuf() instead of
      virtqueue_add_sgs(). Former is simpler to call.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 58ada94f95f71d4f73197ab0e9603dbba6e47fe3)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a6d9f512
    • Y
      virtiofs: Fix old-style declaration · 8270fcad
      YueHaibing 提交于
      task #28910367
      commit 00929447f5758c4f64c74d0a4b40a6eb3d9df0e3 upstream
      
      There expect the 'static' keyword to come first in a declaration, and we
      get warnings like this with "make W=1":
      
      fs/fuse/virtio_fs.c:687:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
      fs/fuse/virtio_fs.c:692:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
      fs/fuse/virtio_fs.c:1029:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration]
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 00929447f5758c4f64c74d0a4b40a6eb3d9df0e3)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8270fcad
    • Z
      virtiofs: Remove set but not used variable 'fc' · 823286b7
      zhengbin 提交于
      task #28910367
      commit 80da5a809d193c60d090cbdf4fe316781bc07965 upstream
      
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      fs/fuse/virtio_fs.c: In function virtio_fs_wake_pending_and_unlock:
      fs/fuse/virtio_fs.c:983:20: warning: variable fc set but not used [-Wunused-but-set-variable]
      
      It is not used since commit 7ee1e2e631db ("virtiofs: No need to check
      fpq->connected state")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      823286b7
    • V
      virtiofs: Retry request submission from worker context · 986957da
      Vivek Goyal 提交于
      task #28910367
      commit a9bfd9dd3417561d06c81de04f6d6c1e0c9b3d44 upstream
      
      If regular request queue gets full, currently we sleep for a bit and
      retrying submission in submitter's context. This assumes submitter is not
      holding any spin lock. But this assumption is not true for background
      requests. For background requests, we are called with fc->bg_lock held.
      
      This can lead to deadlock where one thread is trying submission with
      fc->bg_lock held while request completion thread has called
      fuse_request_end() which tries to acquire fc->bg_lock and gets blocked. As
      request completion thread gets blocked, it does not make further progress
      and that means queue does not get empty and submitter can't submit more
      requests.
      
      To solve this issue, retry submission with the help of a worker, instead of
      retrying in submitter's context. We already do this for hiprio/forget
      requests.
      Reported-by: NChirantan Ekbote <chirantan@chromium.org>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit a9bfd9dd3417561d06c81de04f6d6c1e0c9b3d44)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      986957da
    • V
      virtiofs: Count pending forgets as in_flight forgets · ab040ec8
      Vivek Goyal 提交于
      task #28910367
      
      commit c17ea009610366146ec409fd6dc277e0f2510b10 upstream
      
      If virtqueue is full, we put forget requests on a list and these forgets
      are dispatched later using a worker. As of now we don't count these forgets
      in fsvq->in_flight variable. This means when queue is being drained, we
      have to have special logic to first drain these pending requests and then
      wait for fsvq->in_flight to go to zero.
      
      By counting pending forgets in fsvq->in_flight, we can get rid of special
      logic and just wait for in_flight to go to zero. Worker thread will kick
      and drain all the forgets anyway, leading in_flight to zero.
      
      I also need similar logic for normal request queue in next patch where I am
      about to defer request submission in the worker context if queue is full.
      
      This simplifies the code a bit.
      
      Also add two helper functions to inc/dec in_flight. Decrement in_flight
      helper will later used to call completion when in_flight reaches zero.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit c17ea009610366146ec409fd6dc277e0f2510b10)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ab040ec8
    • V
      virtiofs: Set FR_SENT flag only after request has been sent · f5fa6847
      Vivek Goyal 提交于
      task #28910367
      commit 5dbe190f341206a7896f7e40c1e3a36933d812f3 upstream
      
      FR_SENT flag should be set when request has been sent successfully sent
      over virtqueue. This is used by interrupt logic to figure out if interrupt
      request should be sent or not.
      
      Also add it to fqp->processing list after sending it successfully.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f5fa6847
    • V
      virtiofs: No need to check fpq->connected state · da960fb2
      Vivek Goyal 提交于
      task #28910367
      commit 7ee1e2e631dbf0ff0df2a67a1e01ba3c1dce7a46 upstream
      
      In virtiofs we keep per queue connected state in virtio_fs_vq->connected
      and use that to end request if queue is not connected. And virtiofs does
      not even touch fpq->connected state.
      
      We probably need to merge these two at some point of time. For now,
      simplify the code a bit and do not worry about checking state of
      fpq->connected.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      da960fb2
    • V
      virtiofs: Do not end request in submission context · c3694064
      Vivek Goyal 提交于
      task #28910367
      commit 51fecdd2555b3e0e05a78d30093c638d164a32f9 upstream
      
      Submission context can hold some locks which end request code tries to hold
      again and deadlock can occur. For example, fc->bg_lock. If a background
      request is being submitted, it might hold fc->bg_lock and if we could not
      submit request (because device went away) and tried to end request, then
      deadlock happens. During testing, I also got a warning from deadlock
      detection code.
      
      So put requests on a list and end requests from a worker thread.
      
      I got following warning from deadlock detector.
      
      [  603.137138] WARNING: possible recursive locking detected
      [  603.137142] --------------------------------------------
      [  603.137144] blogbench/2036 is trying to acquire lock:
      [  603.137149] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_request_end+0xdf/0x1c0 [fuse]
      [  603.140701]
      [  603.140701] but task is already holding lock:
      [  603.140703] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_simple_background+0x92/0x1d0 [fuse]
      [  603.140713]
      [  603.140713] other info that might help us debug this:
      [  603.140714]  Possible unsafe locking scenario:
      [  603.140714]
      [  603.140715]        CPU0
      [  603.140716]        ----
      [  603.140716]   lock(&(&fc->bg_lock)->rlock);
      [  603.140718]   lock(&(&fc->bg_lock)->rlock);
      [  603.140719]
      [  603.140719]  *** DEADLOCK ***
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      c3694064
    • V
      virtio-fs: Change module name to virtiofs.ko · 5cb5b717
      Vivek Goyal 提交于
      task #28910367
      commit 112e72373d1f60f1e4558d0a7f0de5da39a1224d upstream
      
      We have been calling it virtio_fs and even file name is virtio_fs.c. Module
      name is virtio_fs.ko but when registering file system user is supposed to
      specify filesystem type as "virtiofs".
      
      Masayoshi Mizuma reported that he specified filesytem type as "virtio_fs"
      and got this warning on console.
      
        ------------[ cut here ]------------
        request_module fs-virtio_fs succeeded, but still no fs?
        WARNING: CPU: 1 PID: 1234 at fs/filesystems.c:274 get_fs_type+0x12c/0x138
        Modules linked in: ... virtio_fs fuse virtio_net net_failover ...
        CPU: 1 PID: 1234 Comm: mount Not tainted 5.4.0-rc1 #1
      
      So looks like kernel could find the module virtio_fs.ko but could not find
      filesystem type after that.
      
      It probably is better to rename module name to virtiofs.ko so that above
      warning goes away in case user ends up specifying wrong fs name.
      Reported-by: NMasayoshi Mizuma <msys.mizuma@gmail.com>
      Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      (cherry picked from commit 112e72373d1f60f1e4558d0a7f0de5da39a1224d)
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      5cb5b717
    • S
      virtio-fs: add virtiofs filesystem · 917f6dfb
      Stefan Hajnoczi 提交于
      task #28910367
      commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a upstream
      
      Add a basic file system module for virtio-fs.  This does not yet contain
      shared data support between host and guest or metadata coherency speedups.
      However it is already significantly faster than virtio-9p.
      
      Design Overview
      ===============
      
      With the goal of designing something with better performance and local file
      system semantics, a bunch of ideas were proposed.
      
       - Use fuse protocol (instead of 9p) for communication between guest and
         host.  Guest kernel will be fuse client and a fuse server will run on
         host to serve the requests.
      
       - For data access inside guest, mmap portion of file in QEMU address space
         and guest accesses this memory using dax.  That way guest page cache is
         bypassed and there is only one copy of data (on host).  This will also
         enable mmap(MAP_SHARED) between guests.
      
       - For metadata coherency, there is a shared memory region which contains
         version number associated with metadata and any guest changing metadata
         updates version number and other guests refresh metadata on next access.
         This is yet to be implemented.
      
      How virtio-fs differs from existing approaches
      ==============================================
      
      The unique idea behind virtio-fs is to take advantage of the co-location of
      the virtual machine and hypervisor to avoid communication (vmexits).
      
      DAX allows file contents to be accessed without communication with the
      hypervisor.  The shared memory region for metadata avoids communication in
      the common case where metadata is unchanged.
      
      By replacing expensive communication with cheaper shared memory accesses,
      we expect to achieve better performance than approaches based on network
      file system protocols.  In addition, this also makes it easier to achieve
      local file system semantics (coherency).
      
      These techniques are not applicable to network file system protocols since
      the communications channel is bypassed by taking advantage of shared memory
      on a local machine.  This is why we decided to build virtio-fs rather than
      focus on 9P or NFS.
      
      Caching Modes
      =============
      
      Like virtio-9p, different caching modes are supported which determine the
      coherency level as well.  The “cache=FOO” and “writeback” options control
      the level of coherence between the guest and host filesystems.
      
       - cache=none
         metadata, data and pathname lookup are not cached in guest.  They are
         always fetched from host and any changes are immediately pushed to host.
      
       - cache=always
         metadata, data and pathname lookup are cached in guest and never expire.
      
       - cache=auto
         metadata and pathname lookup cache expires after a configured amount of
         time (default is 1 second).  Data is cached while the file is open
         (close to open consistency).
      
       - writeback/no_writeback
         These options control the writeback strategy.  If writeback is disabled,
         then normal writes will immediately be synchronized with the host fs.
         If writeback is enabled, then writes may be cached in the guest until
         the file is closed or an fsync(2) performed.  This option has no effect
         on mmap-ed writes or writes going through the DAX mechanism.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      
      (cherry picked from commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a)
      [Liubo: given that 4.19 lacks the support of fs_context to parse mount
      option, here I just change it back to the 4.19 way, so we still use -o
      tag=myfs-1 to get virtiofs mount.]
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      917f6dfb
    • M
      fuse: delete dentry if timeout is zero · 9ffcf1ac
      Miklos Szeredi 提交于
      task #28910367
      commit 8fab010644363f8f80194322aa7a81e38c867af3 upstream
      
      Don't hold onto dentry in lru list if need to re-lookup it anyway at next
      access.  Only do this if explicitly enabled, otherwise it could result in
      performance regression.
      
      More advanced version of this patch would periodically flush out dentries
      from the lru which have gone stale.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9ffcf1ac
    • V
      fuse: Export fuse_dequeue_forget() function · 09db4841
      Vivek Goyal 提交于
      task #28910367
      commit 4388c5aac4bae5c83a2c66882043942002ba09a2 upstream
      
      stacked file systems like virtio-fs do not have to play directly with
      forget list data structures. There is a helper function use that instead.
      
      Rename dequeue_forget() to fuse_dequeue_forget() and export it so that
      stacked filesystems can use it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      09db4841
    • S
      fuse: export fuse_get_unique() · f96b6dd6
      Stefan Hajnoczi 提交于
      task #28910367
      commit 79d96efffda7597b41968d5d8813b39fc2965f1b upstream
      
      virtio-fs will need unique IDs for FORGET requests from outside
      fs/fuse/dev.c.  Make the symbol visible.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f96b6dd6
    • V
      fuse: Separate fuse device allocation and installation in fuse_conn · 0213de76
      Vivek Goyal 提交于
      task #28910367
      commit 0cd1eb9a4160a96e0ec9b93b2e7b489f449bf22d upstream
      
      As of now fuse_dev_alloc() both allocates a fuse device and installs it
      in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation
      fails.
      
      virtio-fs needs to initialize multiple fuse devices (one per virtio
      queue). It initializes one fuse device as part of call to
      fuse_fill_super_common() and rest of the devices are allocated and
      installed after that.
      
      But, we can't affort to fail after calling fuse_fill_super_common() as
      we don't have a way to undo all the actions done by fuse_fill_super_common().
      So to avoid failures after the call to fuse_fill_super_common(),
      pre-allocate all fuse devices early and install them into fuse connection
      later.
      
      This patch provides two separate helpers for fuse device allocation and
      fuse device installation in fuse_conn.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0213de76
    • S
      fuse: add fuse_iqueue_ops callbacks · 57f16587
      Stefan Hajnoczi 提交于
      task #28910367
      commit ae3aad77f46fbba56eff7141b2fc49870b60827e upstream
      
      The /dev/fuse device uses fiq->waitq and fasync to signal that requests
      are available.  These mechanisms do not apply to virtio-fs.  This patch
      introduces callbacks so alternative behavior can be used.
      
      Note that queue_interrupt() changes along these lines:
      
        spin_lock(&fiq->waitq.lock);
        wake_up_locked(&fiq->waitq);
      + kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
        spin_unlock(&fiq->waitq.lock);
      - kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
      
      Since queue_request() and queue_forget() also call kill_fasync() inside
      the spinlock this should be safe.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      57f16587
    • V
      fuse: Export fuse_send_init_request() · 6769b1fd
      Vivek Goyal 提交于
      task #28910367
      commit 95a84cdb11c26315a6d34664846f82c438c961a1 upstream
      
      This will be used by virtio-fs to send init request to fuse server after
      initialization of virt queues.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6769b1fd
    • S
      fuse: export fuse_len_args() · 609c1cf3
      Stefan Hajnoczi 提交于
      task #28910367
      commit 14d46d7abc3973a47e8eb0eb5eb87ee8d910a505 upstream
      
      virtio-fs will need to query the length of fuse_arg lists.  Make the
      symbol visible.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      609c1cf3
    • S
      fuse: export fuse_end_request() · 63b1ffab
      Stefan Hajnoczi 提交于
      task #28910367
      commit 04ec5af0776e9baefed59891f12adbcb5fa71a23 upstream
      
      virtio-fs will need to complete requests from outside fs/fuse/dev.c.
      Make the symbol visible.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      63b1ffab
    • S
      fuse: extract fuse_fill_super_common() · 0fdc23c2
      Stefan Hajnoczi 提交于
      task #28910367
      commit 0cc2656cdb0b1f234e6d29378cb061e29d7522bc upstream
      
      fuse_fill_super() includes code to process the fd= option and link the
      struct fuse_dev to the fd's struct file.  In virtio-fs there is no file
      descriptor because /dev/fuse is not used.
      
      This patch extracts fuse_fill_super_common() so that both classic fuse
      and virtio-fs can share the code to initialize a mount.
      
      parse_fuse_opt() is also extracted so that the fuse_fill_super_common()
      caller has access to the mount options.  This allows classic fuse to
      handle the fd= option outside fuse_fill_super_common().
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0fdc23c2
    • D
      mm: convert PG_balloon to PG_offline · 59df23d6
      David Hildenbrand 提交于
      task #29077503
      commit ca215086b14b89a0e70fc211314944aa6ce50020 upstream
      pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
      page is part of virtio-balloon and therefore logically offline.
      We also want to make use of this flag in other balloon drivers - for
      inflated pages or when onlining a section but keeping some pages offline
      (e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).
      
      We are going to expose this flag to dump tools like makedumpfile.  But
      instead of exposing PG_balloon, let's generalize the concept of marking
      pages as logically offline, so it can be reused for other purposes later
      on.
      
      Rename PG_balloon to PG_offline.  This is an indicator that the page is
      logically offline, the content stale and that it should not be touched
      (e.g.  a hypervisor would have to allocate backing storage in order for
      the guest to dump an unused page).  We can then e.g.  exclude such pages
      from dumps.
      
      We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
      (and for now the semantics stay the same).  In following patches, we
      will make use of this bit also in other balloon drivers.  While at it,
      document PGTABLE.
      
      [akpm@linux-foundation.org: fix comment text, per David]
      Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christian Hansen <chansen3@cisco.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Freche <jfreche@vmware.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Lianbo Jiang <lijiang@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xavier Deguillard <xdeguillard@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      (cherry picked from ccommit ca215086b14b89a0e70fc211314944aa6ce50020)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      59df23d6
    • P
      io_uring: fix recvmsg memory leak with buffer selection · fe15220e
      Pavel Begunkov 提交于
      to #29441901
      
      commit 681fda8d27a66f7e65ff7f2d200d7635e64a8d05 upstream.
      
      io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
      causes a leak when used with automatic buffer selection.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fe15220e
    • Y
      alinux: sched: Add cpu_stress to show system-wide task waiting · ab81d2d9
      Yihao Wu 提交于
      to #28739709
      
      /proc/loadavg can reflex the waiting tasks over a period of time
      to some extent. But to become a SLI requires better precision and
      quicker response. Furthermore, I/O block is not concerned here,
      and bandwidth control is excluded from cpu_stress.
      
      This patch adds a new interface /proc/cpu_stress. It's based on
      task runtime tracking so we don't need to deal with complex state
      transition. And because task runtime tracking is done in most
      scheduler events, the precision is quite enough.
      
      Like loadavg, cpu_stress has 3 average windows too (1,5,15 min)
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ab81d2d9
    • Y
      ovl: inode reference leak in ovl_is_inuse true case. · b247d8a6
      youngjun 提交于
      to #29273482
      
      When "ovl_is_inuse" true case, trap inode reference not put.  plus adding
      the comment explaining sequence of ovl_is_inuse after ovl_setup_trap.
      
      Fixes: 0be0bfd2de9d ("ovl: fix regression caused by overlapping layers detection")
      Cc: <stable@vger.kernel.org> # v4.19+
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: Nyoungjun <her0gyugyu@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git/commit/?h=overlayfs-next&id=24f14009b8f1754ec2ae4c168940c01259b0f88aSigned-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b247d8a6
    • P
      io_uring: fix not initialised work->flags · 109831ff
      Pavel Begunkov 提交于
      to #29276773
      
      commit 16d598030a37853a7a6b4384cad19c9c0af2f021 upstream.
      
      59960b9deb535 ("io_uring: fix lazy work init") tried to fix missing
      io_req_init_async(), but left out work.flags and hash. Do it earlier.
      
      Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      109831ff
    • P
      io_uring: fix missing msg_name assignment · 589ee219
      Pavel Begunkov 提交于
      to #29276773
      
      commit dd821e0c95a64b5923a0c57f07d3f7563553e756 upstream.
      
      Ensure to set msg.msg_name for the async portion of send/recvmsg,
      as the header copy will copy to/from it.
      
      Cc: stable@vger.kernel.org # v5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      589ee219
    • J
      io_uring: account user memory freed when exit has been queued · 8c9ebb73
      Jens Axboe 提交于
      to #29276773
      
      commit 309fc03a3284af62eb6082fb60327045a1dabf57 upstream.
      
      We currently account the memory after the exit work has been run, but
      that leaves a gap where a process has closed its ring and until the
      memory has been accounted as freed. If the memlocked ulimit is
      borderline, then that can introduce spurious setup errors returning
      -ENOMEM because the free work hasn't been run yet.
      
      Account this as freed when we close the ring, as not to expose a tiny
      gap where setting up a new ring can fail.
      
      Fixes: 85faa7b8346e ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8c9ebb73
    • Y
      io_uring: fix memleak in io_sqe_files_register() · c1f5f815
      Yang Yingliang 提交于
      to #29276773
      
      commit 667e57da358f61b6966e12e925a69e42d912e8bb upstream.
      
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0x607eeac06e78 (size 8):
        comm "test", pid 295, jiffies 4294735835 (age 31.745s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
          [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
          [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
          [<00000000591b89a6>] do_syscall_64+0x56/0xa0
          [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Call percpu_ref_exit() on error path to avoid
      refcount memleak.
      
      Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c1f5f815
    • Y
      io_uring: fix memleak in __io_sqe_files_update() · 4b392775
      Yang Yingliang 提交于
      to #29276773
      
      commit f3bd9dae3708a0ff6b067e766073ffeb853301f9 upstream.
      
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888113e02300 (size 488):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
      backtrace:
      [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      BUG: memory leak
      unreferenced object 0xffff8881152dd5e0 (size 16):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 16 bytes):
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
      [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
      [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If io_sqe_file_register() failed, we need put the file that get by fget()
      to avoid the memleak.
      
      Fixes: c3a31e605620 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      4b392775
    • D
      vfs, afs, ext4: Make the inode hash table RCU searchable · dc136109
      David Howells 提交于
      task #29263287
      
      commit 3f19b2ab97a97b413c24b66c67ae16daa4f56c35 upstream
      
      Make the inode hash table RCU searchable so that searches that want to
      access or modify an inode without taking a ref on that inode can do so
      without taking the inode hash table lock.
      
      The main thing this requires is some RCU annotation on the list
      manipulation operations.  Inodes are already freed by RCU in most cases.
      
      Users of this interface must take care as the inode may be still under
      construction or may be being torn down around them.
      
      There are at least three instances where this can be of use:
      
       (1) Testing whether the inode number iunique() is going to return is
           currently unique (the iunique_lock is still held).
      
       (2) Ext4 date stamp updating.
      
       (3) AFS callback breaking.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      cc: linux-ext4@vger.kernel.org
      cc: linux-afs@lists.infradead.org
      [jeffle: resolve collision in afs_break_one_callback since code base change]
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      dc136109
    • X
      io_uring: export cq overflow status to userspace · c7f865a4
      Xiaoguang Wang 提交于
      to #29233603
      
      commit 6d5f904904608a9cd32854d7d0a4dd65b27f9935 upstream
      
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c7f865a4