1. 06 9月, 2021 2 次提交
    • M
      fuse: remove unused arg in fuse_write_file_get() · a9667ac8
      Miklos Szeredi 提交于
      The struct fuse_conn argument is not used and can be removed.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a9667ac8
    • M
      fuse: wait for writepages in syncfs · 660585b5
      Miklos Szeredi 提交于
      In case of fuse the MM subsystem doesn't guarantee that page writeback
      completes by the time ->sync_fs() is called.  This is because fuse
      completes page writeback immediately to prevent DoS of memory reclaim by
      the userspace file server.
      
      This means that fuse itself must ensure that writes are synced before
      sending the SYNCFS request to the server.
      
      Introduce sync buckets, that hold a counter for the number of outstanding
      write requests.  On syncfs replace the current bucket with a new one and
      wait until the old bucket's counter goes down to zero.
      
      It is possible to have multiple syncfs calls in parallel, in which case
      there could be more than one waited-on buckets.  Descendant buckets must
      not complete until the parent completes.  Add a count to the child (new)
      bucket until the (parent) old bucket completes.
      
      Use RCU protection to dereference the current bucket and to wake up an
      emptied bucket.  Use fc->lock to protect against parallel assignments to
      the current bucket.
      
      This leaves just the counter to be a possible scalability issue.  The
      fc->num_waiting counter has a similar issue, so both should be addressed at
      the same time.
      Reported-by: NAmir Goldstein <amir73il@gmail.com>
      Fixes: 2d82ab25 ("virtiofs: propagate sync() to file server")
      Cc: <stable@vger.kernel.org> # v5.14
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      660585b5
  2. 31 8月, 2021 1 次提交
    • M
      fuse: flush extending writes · 59bda8ec
      Miklos Szeredi 提交于
      Callers of fuse_writeback_range() assume that the file is ready for
      modification by the server in the supplied byte range after the call
      returns.
      
      If there's a write that extends the file beyond the end of the supplied
      range, then the file needs to be extended to at least the end of the range,
      but currently that's not done.
      
      There are at least two cases where this can cause problems:
      
       - copy_file_range() will return short count if the file is not extended
         up to end of the source range.
      
       - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
         hence the region may not be fully allocated.
      
      Fix by flushing writes from the start of the range up to the end of the
      file.  This could be optimized if the writes are non-extending, etc, but
      it's probably not worth the trouble.
      
      Fixes: a2bc9236 ("fuse: fix copy_file_range() in the writeback case")
      Fixes: 6b1bdb56 ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)")
      Cc: <stable@vger.kernel.org>  # v5.2
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      59bda8ec
  3. 19 8月, 2021 1 次提交
  4. 18 8月, 2021 1 次提交
    • M
      fuse: truncate pagecache on atomic_o_trunc · 76224355
      Miklos Szeredi 提交于
      fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
      O_TRUNC.  This can deadlock with fuse_wait_on_page_writeback() in
      fuse_launder_page() triggered by invalidate_inode_pages2().
      
      Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
      truncate_pagecache() call.  This makes sense regardless of FOPEN_KEEP_CACHE
      or fc->writeback cache, so do it unconditionally.
      Reported-by: NXie Yongji <xieyongji@bytedance.com>
      Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com
      Fixes: e4648309 ("fuse: truncate pending writes on O_TRUNC")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      76224355
  5. 05 8月, 2021 2 次提交
    • M
      fuse: allow sharing existing sb · 5d5b74aa
      Miklos Szeredi 提交于
      Make it possible to create a new mount from a already working server.
      
      Here's a detailed description of the problem from Jakob:
      
        "The background for this question is occasional problems we see with our
         fuse filesystem [1] and mount namespaces. On a usual client, we have
         system-wide, autofs managed mountpoints. When a new mount namespace is
         created (which can be done unprivileged in combination with user
         namespaces), it can happen that a mountpoint is used inside the new
         namespace but idle in the root mount namespace. So autofs unmounts the
         parent, system-wide mountpoint. But the fuse module stays active and
         still serves mountpoint in the child mount namespace. Because the fuse
         daemon also blocks other system wide resources corresponding to the
         mountpoint, this situation effectively prevents new mounts until the
         child mount namespaces closes.
      
         [1] https://github.com/cvmfs/cvmfs"
      Reported-by: NJakob Blomer <jblomer@cern.ch>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      5d5b74aa
    • M
      fuse: move fget() to fuse_get_tree() · 62dd1fc8
      Miklos Szeredi 提交于
      Affected call chains:
      
      fuse_get_tree
         -> get_tree_(bdev|nodev)
            -> fuse_fill_super
      
      Needed for following patch.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      62dd1fc8
  6. 04 8月, 2021 3 次提交
    • M
      fuse: move option checking into fuse_fill_super() · badc7414
      Miklos Szeredi 提交于
      Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount
      options are present can be moved from fuse_get_tree() into
      fuse_fill_super() where the value of the options are consumed.
      
      This relaxes semantics of reusing a fuse blockdev mount using the device
      name.  Before this patch presence of these options were enforced but values
      ignored, after this patch these options are completely ignored in this
      case.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      badc7414
    • M
      fuse: name fs_context consistently · 84c21507
      Miklos Szeredi 提交于
      Naming convention under fs/fuse/:
      
      	struct fuse_conn *fc;
      	struct fs_context *fsc;
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      84c21507
    • M
      fuse: fix use after free in fuse_read_interrupt() · e1e71c16
      Miklos Szeredi 提交于
      There is a potential race between fuse_read_interrupt() and
      fuse_request_end().
      
      TASK1
        in fuse_read_interrupt(): delete req->intr_entry (while holding
        fiq->lock)
      
      TASK2
        in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock
        wake up TASK3
      
      TASK3
        request is freed
      
      TASK1
        in fuse_read_interrupt(): dereference req->in.h.unique ***BAM***
      
      
      Fix by always grabbing fiq->lock if the request was ever interrupted
      (FR_INTERRUPTED set) thereby serializing with concurrent
      fuse_read_interrupt() calls.
      
      FR_INTERRUPTED is set before the request is queued on fiq->interrupts.
      Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not
      cleared in this case.
      Reported-by: Nlijiazi <lijiazi@xiaomi.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e1e71c16
  7. 13 7月, 2021 1 次提交
  8. 08 7月, 2021 1 次提交
  9. 30 6月, 2021 2 次提交
  10. 22 6月, 2021 11 次提交
    • Z
      virtiofs: Fix spelling mistakes · c4e0cd4e
      Zheng Yongjun 提交于
      Fix some spelling mistakes in comments:
      refernce  ==> reference
      happnes  ==> happens
      threhold  ==> threshold
      splitted  ==> split
      mached  ==> matched
      Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c4e0cd4e
    • W
      fuse: use DIV_ROUND_UP helper macro for calculations · 6c88632b
      Wu Bo 提交于
      Replace open coded divisor calculations with the DIV_ROUND_UP kernel macro
      for better readability.
      Signed-off-by: NWu Bo <wubo40@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6c88632b
    • A
      fuse: fix illegal access to inode with reused nodeid · 15db1683
      Amir Goldstein 提交于
      Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...)
      with ourarg containing nodeid and generation.
      
      If a fuse inode is found in inode cache with the same nodeid but different
      generation, the existing fuse inode should be unhashed and marked "bad" and
      a new inode with the new generation should be hashed instead.
      
      This can happen, for example, with passhrough fuse filesystem that returns
      the real filesystem ino/generation on lookup and where real inode numbers
      can get recycled due to real files being unlinked not via the fuse
      passthrough filesystem.
      
      With current code, this situation will not be detected and an old fuse
      dentry that used to point to an older generation real inode, can be used to
      access a completely new inode, which should be accessed only via the new
      dentry.
      
      Note that because the FORGET message carries the nodeid w/o generation, the
      server should wait to get FORGET counts for the nlookup counts of the old
      and reused inodes combined, before it can free the resources associated to
      that nodeid.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      15db1683
    • R
      fuse: allow fallocate(FALLOC_FL_ZERO_RANGE) · 6b1bdb56
      Richard W.M. Jones 提交于
      The current fuse module filters out fallocate(FALLOC_FL_ZERO_RANGE)
      returning -EOPNOTSUPP.  libnbd's nbdfuse would like to translate
      FALLOC_FL_ZERO_RANGE requests into the NBD command
      NBD_CMD_WRITE_ZEROES which allows NBD servers that support it to do
      zeroing efficiently.
      
      This commit treats this flag exactly like FALLOC_FL_PUNCH_HOLE.
      
      A way to test this, requiring fuse >= 3, nbdkit >= 1.8 and the latest
      nbdfuse from https://gitlab.com/nbdkit/libnbd/-/tree/master/fuse is to
      create a file containing some data and "mirror" it to a fuse file:
      
        $ dd if=/dev/urandom of=disk.img bs=1M count=1
        $ nbdkit file disk.img
        $ touch mirror.img
        $ nbdfuse mirror.img nbd://localhost &
      
      (mirror.img -> nbdfuse -> NBD over loopback -> nbdkit -> disk.img)
      
      You can then run commands such as:
      
        $ fallocate -z -o 1024 -l 1024 mirror.img
      
      and check that the content of the original file ("disk.img") stays
      synchronized.  To show NBD commands, export LIBNBD_DEBUG=1 before
      running nbdfuse.  To clean up:
      
        $ fusermount3 -u mirror.img
        $ killall nbdkit
      Signed-off-by: NRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6b1bdb56
    • G
      fuse: Make fuse_fill_super_submount() static · 1b539917
      Greg Kurz 提交于
      This function used to be called from fuse_dentry_automount(). This code
      was moved to fuse_get_tree_submount() in the same file since then. It
      is unlikely there will ever be another user. No need to be extern in
      this case.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      1b539917
    • G
      fuse: Switch to fc_mount() for submounts · 29e0e4df
      Greg Kurz 提交于
      fc_mount() already handles the vfs_get_tree(), sb->s_umount
      unlocking and vfs_create_mount() sequence. Using it greatly
      simplifies fuse_dentry_automount().
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      29e0e4df
    • G
      fuse: Call vfs_get_tree() for submounts · 266eb3f2
      Greg Kurz 提交于
      We recently fixed an infinite loop by setting the SB_BORN flag on
      submounts along with the write barrier needed by super_cache_count().
      This is the job of vfs_get_tree() and FUSE shouldn't have to care
      about the barrier at all.
      
      Split out some code from fuse_dentry_automount() to the dedicated
      fuse_get_tree_submount() handler for submounts and call vfs_get_tree().
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      266eb3f2
    • G
      fuse: add dedicated filesystem context ops for submounts · fe0a7bd8
      Greg Kurz 提交于
      The creation of a submount is open-coded in fuse_dentry_automount().
      This brings a lot of complexity and we recently had to fix bugs
      because we weren't setting SB_BORN or because we were unlocking
      sb->s_umount before sb was fully configured. Most of these could
      have been avoided by using the mount API instead of open-coding.
      
      Basically, this means coming up with a proper ->get_tree()
      implementation for submounts and call vfs_get_tree(), or better
      fc_mount().
      
      The creation of the superblock for submounts is quite different from
      the root mount. Especially, it doesn't require to allocate a FUSE
      filesystem context, nor to parse parameters.
      
      Introduce a dedicated context ops for submounts to make this clear.
      This is just a placeholder for now, fuse_get_tree_submount() will
      be populated in a subsequent patch.
      
      Only visible change is that we stop allocating/freeing a useless FUSE
      filesystem context with submounts.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      fe0a7bd8
    • G
      virtiofs: propagate sync() to file server · 2d82ab25
      Greg Kurz 提交于
      Even if POSIX doesn't mandate it, linux users legitimately expect sync() to
      flush all data and metadata to physical storage when it is located on the
      same system.  This isn't happening with virtiofs though: sync() inside the
      guest returns right away even though data still needs to be flushed from
      the host page cache.
      
      This is easily demonstrated by doing the following in the guest:
      
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync
      5120+0 records in
      5120+0 records out
      5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s
      sync()                                  = 0 <0.024068>
      
      and start the following in the host when the 'dd' command completes
      in the guest:
      
      $ strace -T -e fsync /usr/bin/sync virtiofs/foo
      fsync(3)                                = 0 <10.371640>
      
      There are no good reasons not to honor the expected behavior of sync()
      actually: it gives an unrealistic impression that virtiofs is super fast
      and that data has safely landed on HW, which isn't the case obviously.
      
      Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS
      request type for this purpose.  Provision a 64-bit placeholder for possible
      future extensions.  Since the file server cannot handle the wait == 0 case,
      we skip it to avoid a gratuitous roundtrip.  Note that this is
      per-superblock: a FUSE_SYNCFS is send for the root mount and for each
      submount.
      
      Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in
      the file server is treated as permanent success.  This ensures
      compatibility with older file servers: the client will get the current
      behavior of sync() not being propagated to the file server.
      
      Note that such an operation allows the file server to DoS sync().  Since a
      typical FUSE file server is an untrusted piece of software running in
      userspace, this is disabled by default.  Only enable it with virtiofs for
      now since virtiofsd is supposedly trusted by the guest kernel.
      Reported-by: NRobert Krawitz <rlk@redhat.com>
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2d82ab25
    • M
      fuse: reject internal errno · 49221cf8
      Miklos Szeredi 提交于
      Don't allow userspace to report errors that could be kernel-internal.
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Fixes: 334f485d ("[PATCH] FUSE - device functions")
      Cc: <stable@vger.kernel.org> # v2.6.14
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      49221cf8
    • M
      fuse: check connected before queueing on fpq->io · 80ef0867
      Miklos Szeredi 提交于
      A request could end up on the fpq->io list after fuse_abort_conn() has
      reset fpq->connected and aborted requests on that list:
      
      Thread-1			  Thread-2
      ========			  ========
      ->fuse_simple_request()           ->shutdown
        ->__fuse_request_send()
          ->queue_request()		->fuse_abort_conn()
      ->fuse_dev_do_read()                ->acquire(fpq->lock)
        ->wait_for(fpq->lock) 	  ->set err to all req's in fpq->io
      				  ->release(fpq->lock)
        ->acquire(fpq->lock)
        ->add req to fpq->io
      
      After the userspace copy is done the request will be ended, but
      req->out.h.error will remain uninitialized.  Also the copy might block
      despite being already aborted.
      
      Fix both issues by not allowing the request to be queued on the fpq->io
      list after fuse_abort_conn() has processed this list.
      Reported-by: NPradeep P V K <pragalla@codeaurora.org>
      Fixes: fd22d62e ("fuse: no fc->lock for iqueue parts")
      Cc: <stable@vger.kernel.org> # v4.2
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      80ef0867
  11. 19 6月, 2021 1 次提交
    • M
      fuse: ignore PG_workingset after stealing · b89ecd60
      Miklos Szeredi 提交于
      Fix the "fuse: trying to steal weird page" warning.
      
      Description from Johannes Weiner:
      
        "Think of it as similar to PG_active. It's just another usage/heat
         indicator of file and anon pages on the reclaim LRU that, unlike
         PG_active, persists across deactivation and even reclaim (we store it in
         the page cache / swapper cache tree until the page refaults).
      
         So if fuse accepts pages that can legally have PG_active set,
         PG_workingset is fine too."
      Reported-by: NThomas Lindroth <thomas.lindroth@gmail.com>
      Fixes: 1899ad18 ("mm: workingset: tell cache transitions from workingset thrashing")
      Cc: <stable@vger.kernel.org> # v4.20
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      b89ecd60
  12. 10 6月, 2021 1 次提交
  13. 09 6月, 2021 3 次提交
    • G
      fuse: Fix infinite loop in sget_fc() · e4a9ccdd
      Greg Kurz 提交于
      We don't set the SB_BORN flag on submounts. This is wrong as these
      superblocks are then considered as partially constructed or dying
      in the rest of the code and can break some assumptions.
      
      One such case is when you have a virtiofs filesystem with submounts
      and you try to mount it again : virtio_fs_get_tree() tries to obtain
      a superblock with sget_fc(). The logic in sget_fc() is to loop until
      it has either found an existing matching superblock with SB_BORN set
      or to create a brand new one. It is assumed that a superblock without
      SB_BORN is transient and the loop is restarted. Forgetting to set
      SB_BORN on submounts hence causes sget_fc() to retry forever.
      
      Setting SB_BORN requires special care, i.e. a write barrier for
      super_cache_count() which can check SB_BORN without taking any lock.
      We should call vfs_get_tree() to deal with that but this requires
      to have a proper ->get_tree() implementation for submounts, which
      is a bigger piece of work. Go for a simple bug fix in the meatime.
      
      Fixes: bf109c64 ("fuse: implement crossmounts")
      Cc: stable@vger.kernel.org # v5.10+
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e4a9ccdd
    • G
      fuse: Fix crash if superblock of submount gets killed early · e3a43f2a
      Greg Kurz 提交于
      As soon as fuse_dentry_automount() does up_write(&sb->s_umount), the
      superblock can theoretically be killed. If this happens before the
      submount was added to the &fc->mounts list, fuse_mount_remove() later
      crashes in list_del_init() because it assumes the submount to be
      already there.
      
      Add the submount before dropping sb->s_umount to fix the inconsistency.
      It is okay to nest fc->killsb under sb->s_umount, we already do this
      on the ->kill_sb() path.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Fixes: bf109c64 ("fuse: implement crossmounts")
      Cc: stable@vger.kernel.org # v5.10+
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e3a43f2a
    • G
      fuse: Fix crash in fuse_dentry_automount() error path · d92d88f0
      Greg Kurz 提交于
      If fuse_fill_super_submount() returns an error, the error path
      triggers a crash:
      
      [   26.206673] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [...]
      [   26.226362] RIP: 0010:__list_del_entry_valid+0x25/0x90
      [...]
      [   26.247938] Call Trace:
      [   26.248300]  fuse_mount_remove+0x2c/0x70 [fuse]
      [   26.248892]  virtio_kill_sb+0x22/0x160 [virtiofs]
      [   26.249487]  deactivate_locked_super+0x36/0xa0
      [   26.250077]  fuse_dentry_automount+0x178/0x1a0 [fuse]
      
      The crash happens because fuse_mount_remove() assumes that the FUSE
      mount was already added to list under the FUSE connection, but this
      only done after fuse_fill_super_submount() has returned success.
      
      This means that until fuse_fill_super_submount() has returned success,
      the FUSE mount isn't actually owned by the superblock. We should thus
      reclaim ownership by clearing sb->s_fs_info, which will skip the call
      to fuse_mount_remove(), and perform rollback, like virtio_fs_get_tree()
      already does for the root sb.
      
      Fixes: bf109c64 ("fuse: implement crossmounts")
      Cc: stable@vger.kernel.org # v5.10+
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NMax Reitz <mreitz@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d92d88f0
  14. 03 6月, 2021 1 次提交
  15. 16 4月, 2021 1 次提交
  16. 14 4月, 2021 8 次提交
    • M
      cuse: simplify refcount · 3c9c1433
      Miklos Szeredi 提交于
      Put extra reference early in cuse_channel_open().
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      3c9c1433
    • M
      cuse: prevent clone · 8217673d
      Miklos Szeredi 提交于
      For cloned connections cuse_channel_release() will be called more than
      once, resulting in use after free.
      
      Prevent device cloning for CUSE, which does not make sense at this point,
      and highly unlikely to be used in real life.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      8217673d
    • M
      virtiofs: fix userns · 0a7419c6
      Miklos Szeredi 提交于
      get_user_ns() is done twice (once in virtio_fs_get_tree() and once in
      fuse_conn_init()), resulting in a reference leak.
      
      Also looks better to use fsc->user_ns (which *should* be the
      current_user_ns() at this point).
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      0a7419c6
    • J
      virtiofs: remove useless function · 07595bfa
      Jiapeng Chong 提交于
      Fix the following clang warning:
      
      fs/fuse/virtio_fs.c:130:35: warning: unused function 'vq_to_fpq'
      [-Wunused-function].
      Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      07595bfa
    • C
      virtiofs: split requests that exceed virtqueue size · a7f0d7aa
      Connor Kuehl 提交于
      If an incoming FUSE request can't fit on the virtqueue, the request is
      placed onto a workqueue so a worker can try to resubmit it later where
      there will (hopefully) be space for it next time.
      
      This is fine for requests that aren't larger than a virtqueue's maximum
      capacity.  However, if a request's size exceeds the maximum capacity of the
      virtqueue (even if the virtqueue is empty), it will be doomed to a life of
      being placed on the workqueue, removed, discovered it won't fit, and placed
      on the workqueue yet again.
      
      Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
      Descriptors) of the virtio spec:
      
        "A driver MUST NOT create a descriptor chain longer than the Queue
        Size of the device."
      
      To fix this, limit the number of pages FUSE will use for an overall
      request.  This way, each request can realistically fit on the virtqueue
      when it is decomposed into a scattergather list and avoid violating section
      2.6.5.3.1 of the virtio spec.
      Signed-off-by: NConnor Kuehl <ckuehl@redhat.com>
      Reviewed-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a7f0d7aa
    • L
      virtiofs: fix memory leak in virtio_fs_probe() · c79c5e01
      Luis Henriques 提交于
      When accidentally passing twice the same tag to qemu, kmemleak ended up
      reporting a memory leak in virtiofs.  Also, looking at the log I saw the
      following error (that's when I realised the duplicated tag):
      
        virtiofs: probe of virtio5 failed with error -17
      
      Here's the kmemleak log for reference:
      
      unreferenced object 0xffff888103d47800 (size 1024):
        comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s)
        hex dump (first 32 bytes):
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
          ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff  ................
        backtrace:
          [<000000000ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs]
          [<00000000f8aca419>] virtio_dev_probe+0x15f/0x210
          [<000000004d6baf3c>] really_probe+0xea/0x430
          [<00000000a6ceeac8>] device_driver_attach+0xa8/0xb0
          [<00000000196f47a7>] __driver_attach+0x98/0x140
          [<000000000b20601d>] bus_for_each_dev+0x7b/0xc0
          [<00000000399c7b7f>] bus_add_driver+0x11b/0x1f0
          [<0000000032b09ba7>] driver_register+0x8f/0xe0
          [<00000000cdd55998>] 0xffffffffa002c013
          [<000000000ea196a2>] do_one_initcall+0x64/0x2e0
          [<0000000008f727ce>] do_init_module+0x5c/0x260
          [<000000003cdedab6>] __do_sys_finit_module+0xb5/0x120
          [<00000000ad2f48c6>] do_syscall_64+0x33/0x40
          [<00000000809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NLuis Henriques <lhenriques@suse.de>
      Fixes: a62a8ef9 ("virtio-fs: add virtiofs filesystem")
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Reviewed-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c79c5e01
    • V
      fuse: invalidate attrs when page writeback completes · 3466958b
      Vivek Goyal 提交于
      In fuse when a direct/write-through write happens we invalidate attrs
      because that might have updated mtime/ctime on server and cached
      mtime/ctime will be stale.
      
      What about page writeback path.  Looks like we don't invalidate attrs
      there.  To be consistent, invalidate attrs in writeback path as well.  Only
      exception is when writeback_cache is enabled.  In that case we strust local
      mtime/ctime and there is no need to invalidate attrs.
      
      Recently users started experiencing failure of xfstests generic/080,
      geneirc/215 and generic/614 on virtiofs.  This happened only newer "stat"
      utility and not older one.  This patch fixes the issue.
      
      So what's the root cause of the issue.  Here is detailed explanation.
      
      generic/080 test does mmap write to a file, closes the file and then checks
      if mtime has been updated or not.  When file is closed, it leads to
      flushing of dirty pages (and that should update mtime/ctime on server).
      But we did not explicitly invalidate attrs after writeback finished.  Still
      generic/080 passed so far and reason being that we invalidated atime in
      fuse_readpages_end().  This is called in fuse_readahead() path and always
      seems to trigger before mmaped write.
      
      So after mmaped write when lstat() is called, it sees that atleast one of
      the fields being asked for is invalid (atime) and that results in
      generating GETATTR to server and mtime/ctime also get updated and test
      passes.
      
      But newer /usr/bin/stat seems to have moved to using statx() syscall now
      (instead of using lstat()).  And statx() allows it to query only ctime or
      mtime (and not rest of the basic stat fields).  That means when querying
      for mtime, fuse_update_get_attr() sees that mtime is not invalid (only
      atime is invalid).  So it does not generate a new GETATTR and fill stat
      with cached mtime/ctime.  And that means updated mtime is not seen by
      xfstest and tests start failing.
      
      Invalidating attrs after writeback completion should solve this problem in
      a generic manner.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      3466958b
    • V
      fuse: add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID · 550a7d3b
      Vivek Goyal 提交于
      When posix access ACL is set, it can have an effect on file mode and it can
      also need to clear SGID if.
      
      - None of caller's group/supplementary groups match file owner group.
      AND
      - Caller is not priviliged (No CAP_FSETID).
      
      As of now fuser server is responsible for changing the file mode as
      well. But it does not know whether to clear SGID or not.
      
      So add a flag FUSE_SETXATTR_ACL_KILL_SGID and send this info with SETXATTR
      to let file server know that sgid needs to be cleared as well.
      Reported-by: NLuis Henriques <lhenriques@suse.de>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      550a7d3b