提交 · 75b96f0ec5faf730128c32187e3e28441c27a094 · openeuler / Kernel

06 9月, 2021 1 次提交

fuse: wait for writepages in syncfs · 660585b5

由 Miklos Szeredi 提交于 9月 01, 2021

In case of fuse the MM subsystem doesn't guarantee that page writeback
completes by the time ->sync_fs() is called.  This is because fuse
completes page writeback immediately to prevent DoS of memory reclaim by
the userspace file server.

This means that fuse itself must ensure that writes are synced before
sending the SYNCFS request to the server.

Introduce sync buckets, that hold a counter for the number of outstanding
write requests.  On syncfs replace the current bucket with a new one and
wait until the old bucket's counter goes down to zero.

It is possible to have multiple syncfs calls in parallel, in which case
there could be more than one waited-on buckets.  Descendant buckets must
not complete until the parent completes.  Add a count to the child (new)
bucket until the (parent) old bucket completes.

Use RCU protection to dereference the current bucket and to wake up an
emptied bucket.  Use fc->lock to protect against parallel assignments to
the current bucket.

This leaves just the counter to be a possible scalability issue.  The
fc->num_waiting counter has a similar issue, so both should be addressed at
the same time.
Reported-by: NAmir Goldstein <amir73il@gmail.com>
Fixes: 2d82ab25 ("virtiofs: propagate sync() to file server")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

660585b5

05 8月, 2021 2 次提交

fuse: allow sharing existing sb · 5d5b74aa

由 Miklos Szeredi 提交于 8月 05, 2021

Make it possible to create a new mount from a already working server.

Here's a detailed description of the problem from Jakob:

  "The background for this question is occasional problems we see with our
   fuse filesystem [1] and mount namespaces. On a usual client, we have
   system-wide, autofs managed mountpoints. When a new mount namespace is
   created (which can be done unprivileged in combination with user
   namespaces), it can happen that a mountpoint is used inside the new
   namespace but idle in the root mount namespace. So autofs unmounts the
   parent, system-wide mountpoint. But the fuse module stays active and
   still serves mountpoint in the child mount namespace. Because the fuse
   daemon also blocks other system wide resources corresponding to the
   mountpoint, this situation effectively prevents new mounts until the
   child mount namespaces closes.

   [1] https://github.com/cvmfs/cvmfs"
Reported-by: NJakob Blomer <jblomer@cern.ch>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

5d5b74aa

fuse: move fget() to fuse_get_tree() · 62dd1fc8

由 Miklos Szeredi 提交于 8月 05, 2021

Affected call chains:

fuse_get_tree
   -> get_tree_(bdev|nodev)
      -> fuse_fill_super

Needed for following patch.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

62dd1fc8

04 8月, 2021 2 次提交

fuse: move option checking into fuse_fill_super() · badc7414

由 Miklos Szeredi 提交于 8月 04, 2021

Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount
options are present can be moved from fuse_get_tree() into
fuse_fill_super() where the value of the options are consumed.

This relaxes semantics of reusing a fuse blockdev mount using the device
name.  Before this patch presence of these options were enforced but values
ignored, after this patch these options are completely ignored in this
case.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

badc7414

fuse: name fs_context consistently · 84c21507

由 Miklos Szeredi 提交于 8月 04, 2021

Naming convention under fs/fuse/:

	struct fuse_conn *fc;
	struct fs_context *fsc;
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

84c21507

13 7月, 2021 1 次提交

fuse: Convert to using invalidate_lock · 8bcbbe9c

由 Jan Kara 提交于 4月 21, 2021

Use invalidate_lock instead of fuse's private i_mmap_sem. The intended
purpose is exactly the same. By this conversion we fix a long standing
race between hole punching and read(2) / readahead(2) paths that can
lead to stale page cache contents.

CC: Miklos Szeredi <miklos@szeredi.hu>
Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NJan Kara <jack@suse.cz>

8bcbbe9c

22 6月, 2021 5 次提交

fuse: fix illegal access to inode with reused nodeid · 15db1683

由 Amir Goldstein 提交于 6月 21, 2021

Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...)
with ourarg containing nodeid and generation.

If a fuse inode is found in inode cache with the same nodeid but different
generation, the existing fuse inode should be unhashed and marked "bad" and
a new inode with the new generation should be hashed instead.

This can happen, for example, with passhrough fuse filesystem that returns
the real filesystem ino/generation on lookup and where real inode numbers
can get recycled due to real files being unlinked not via the fuse
passthrough filesystem.

With current code, this situation will not be detected and an old fuse
dentry that used to point to an older generation real inode, can be used to
access a completely new inode, which should be accessed only via the new
dentry.

Note that because the FORGET message carries the nodeid w/o generation, the
server should wait to get FORGET counts for the nlookup counts of the old
and reused inodes combined, before it can free the resources associated to
that nodeid.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

15db1683

fuse: Make fuse_fill_super_submount() static · 1b539917

由 Greg Kurz 提交于 6月 04, 2021

This function used to be called from fuse_dentry_automount(). This code
was moved to fuse_get_tree_submount() in the same file since then. It
is unlikely there will ever be another user. No need to be extern in
this case.
Signed-off-by: NGreg Kurz <groug@kaod.org>
Reviewed-by: NMax Reitz <mreitz@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

1b539917

fuse: Call vfs_get_tree() for submounts · 266eb3f2

由 Greg Kurz 提交于 6月 04, 2021

We recently fixed an infinite loop by setting the SB_BORN flag on
submounts along with the write barrier needed by super_cache_count().
This is the job of vfs_get_tree() and FUSE shouldn't have to care
about the barrier at all.

Split out some code from fuse_dentry_automount() to the dedicated
fuse_get_tree_submount() handler for submounts and call vfs_get_tree().
Signed-off-by: NGreg Kurz <groug@kaod.org>
Reviewed-by: NMax Reitz <mreitz@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

266eb3f2

fuse: add dedicated filesystem context ops for submounts · fe0a7bd8

由 Greg Kurz 提交于 6月 04, 2021

The creation of a submount is open-coded in fuse_dentry_automount().
This brings a lot of complexity and we recently had to fix bugs
because we weren't setting SB_BORN or because we were unlocking
sb->s_umount before sb was fully configured. Most of these could
have been avoided by using the mount API instead of open-coding.

Basically, this means coming up with a proper ->get_tree()
implementation for submounts and call vfs_get_tree(), or better
fc_mount().

The creation of the superblock for submounts is quite different from
the root mount. Especially, it doesn't require to allocate a FUSE
filesystem context, nor to parse parameters.

Introduce a dedicated context ops for submounts to make this clear.
This is just a placeholder for now, fuse_get_tree_submount() will
be populated in a subsequent patch.

Only visible change is that we stop allocating/freeing a useless FUSE
filesystem context with submounts.
Signed-off-by: NGreg Kurz <groug@kaod.org>
Reviewed-by: NMax Reitz <mreitz@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

fe0a7bd8

virtiofs: propagate sync() to file server · 2d82ab25

由 Greg Kurz 提交于 5月 20, 2021

Even if POSIX doesn't mandate it, linux users legitimately expect sync() to
flush all data and metadata to physical storage when it is located on the
same system.  This isn't happening with virtiofs though: sync() inside the
guest returns right away even though data still needs to be flushed from
the host page cache.

This is easily demonstrated by doing the following in the guest:

$ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s
sync()                                  = 0 <0.024068>

and start the following in the host when the 'dd' command completes
in the guest:

$ strace -T -e fsync /usr/bin/sync virtiofs/foo
fsync(3)                                = 0 <10.371640>

There are no good reasons not to honor the expected behavior of sync()
actually: it gives an unrealistic impression that virtiofs is super fast
and that data has safely landed on HW, which isn't the case obviously.

Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS
request type for this purpose.  Provision a 64-bit placeholder for possible
future extensions.  Since the file server cannot handle the wait == 0 case,
we skip it to avoid a gratuitous roundtrip.  Note that this is
per-superblock: a FUSE_SYNCFS is send for the root mount and for each
submount.

Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in
the file server is treated as permanent success.  This ensures
compatibility with older file servers: the client will get the current
behavior of sync() not being propagated to the file server.

Note that such an operation allows the file server to DoS sync().  Since a
typical FUSE file server is an untrusted piece of software running in
userspace, this is disabled by default.  Only enable it with virtiofs for
now since virtiofsd is supposedly trusted by the guest kernel.
Reported-by: NRobert Krawitz <rlk@redhat.com>
Signed-off-by: NGreg Kurz <groug@kaod.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

2d82ab25

16 4月, 2021 1 次提交
- A
  useful constants: struct qstr for ".." · 80e5d1ff
  由 Al Viro 提交于 4月 15, 2021
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  80e5d1ff
14 4月, 2021 2 次提交

virtiofs: split requests that exceed virtqueue size · a7f0d7aa

由 Connor Kuehl 提交于 3月 18, 2021

If an incoming FUSE request can't fit on the virtqueue, the request is
placed onto a workqueue so a worker can try to resubmit it later where
there will (hopefully) be space for it next time.

This is fine for requests that aren't larger than a virtqueue's maximum
capacity.  However, if a request's size exceeds the maximum capacity of the
virtqueue (even if the virtqueue is empty), it will be doomed to a life of
being placed on the workqueue, removed, discovered it won't fit, and placed
on the workqueue yet again.

Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
Descriptors) of the virtio spec:

  "A driver MUST NOT create a descriptor chain longer than the Queue
  Size of the device."

To fix this, limit the number of pages FUSE will use for an overall
request.  This way, each request can realistically fit on the virtqueue
when it is decomposed into a scattergather list and avoid violating section
2.6.5.3.1 of the virtio spec.
Signed-off-by: NConnor Kuehl <ckuehl@redhat.com>
Reviewed-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

a7f0d7aa

fuse: extend FUSE_SETXATTR request · 52a4c95f

由 Vivek Goyal 提交于 3月 25, 2021

Fuse client needs to send additional information to file server when it
calls SETXATTR(system.posix_acl_access), so add extra flags field to the
structure.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

52a4c95f

08 3月, 2021 1 次提交

new helper: inode_wrong_type() · 6e3e2c43

由 Al Viro 提交于 3月 01, 2021

inode_wrong_type(inode, mode) returns true if setting inode->i_mode
to given value would've changed the inode type.  We have enough of
those checks open-coded to make a helper worthwhile.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6e3e2c43

10 12月, 2020 1 次提交

fuse: fix bad inode · 5d069dbe

由 Miklos Szeredi 提交于 12月 10, 2020

Jan Kara's analysis of the syzbot report (edited):

  The reproducer opens a directory on FUSE filesystem, it then attaches
  dnotify mark to the open directory.  After that a fuse_do_getattr() call
  finds that attributes returned by the server are inconsistent, and calls
  make_bad_inode() which, among other things does:

          inode->i_mode = S_IFREG;

  This then confuses dnotify which doesn't tear down its structures
  properly and eventually crashes.

Avoid calling make_bad_inode() on a live inode: switch to a private flag on
the fuse inode.  Also add the test to ops which the bad_inode_ops would
have caught.

This bug goes back to the initial merge of fuse in 2.6.14...

Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Tested-by: NJan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>

5d069dbe

12 11月, 2020 5 次提交

fuse: support SB_NOSEC flag to improve write performance · 9d769e6a

由 Vivek Goyal 提交于 10月 09, 2020

Virtiofs can be slow with small writes if xattr are enabled and we are
doing cached writes (No direct I/O). Ganesh Mahalingam noticed this.

Some debugging showed that file_remove_privs() is called in cached write
path on every write. And everytime it calls security_inode_need_killpriv()
which results in call to __vfs_getxattr(XATTR_NAME_CAPS). And this goes to
file server to fetch xattr. This extra round trip for every write slows
down writes tremendously.

Normally to avoid paying this penalty on every write, vfs has the notion of
caching this information in inode (S_NOSEC). So vfs sets S_NOSEC, if
filesystem opted for it using super block flag SB_NOSEC. And S_NOSEC is
cleared when setuid/setgid bit is set or when security xattr is set on
inode so that next time a write happens, we check inode again for clearing
setuid/setgid bits as well clear any security.capability xattr.

This seems to work well for local file systems but for remote file systems
it is possible that VFS does not have full picture and a different client
sets setuid/setgid bit or security.capability xattr on file and that means
VFS information about S_NOSEC on another client will be stale. So for
remote filesystems SB_NOSEC was disabled by default.

Commit 9e1f1de0 ("more conservative S_NOSEC handling") mentioned that
these filesystems can still make use of SB_NOSEC as long as they clear
S_NOSEC when they are refreshing inode attriutes from server.

So this patch tries to enable SB_NOSEC on fuse (regular fuse as well as
virtiofs). And clear SB_NOSEC when we are refreshing inode attributes.

This is enabled only if server supports FUSE_HANDLE_KILLPRIV_V2. This says
that server will clear setuid/setgid/security.capability on
chown/truncate/write as apporpriate.

This should provide tighter coherency because now suid/sgid/
security.capability will be cleared even if fuse client cache has not seen
these attrs.

Basic idea is that fuse client will trigger suid/sgid/security.capability
clearing based on its attr cache. But even if cache has gone stale, it is
fine because FUSE_HANDLE_KILLPRIV_V2 will make sure WRITE clear
suid/sgid/security.capability.

We make this change only if server supports FUSE_HANDLE_KILLPRIV_V2. This
should make sure that existing filesystems which might be relying on
seucurity.capability always being queried from server are not impacted.

This tighter coherency relies on WRITE showing up on server (and not being
cached in guest). So writeback_cache mode will not provide that tight
coherency and it is not recommended to use two together. Having said that
it might work reasonably well for lot of use cases.

This change improves random write performance very significantly. Running
virtiofsd with cache=auto and following fio command:

fio --ioengine=libaio --direct=1 --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

Bandwidth increases from around 50MB/s to around 250MB/s as a result of
applying this patch. So improvement is very significant.

Link: https://github.com/kata-containers/runtime/issues/2815Reported-by: N"Mahalingam, Ganesh" <ganesh.mahalingam@intel.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

9d769e6a

fuse: introduce the notion of FUSE_HANDLE_KILLPRIV_V2 · 63f9909f

由 Vivek Goyal 提交于 10月 09, 2020

We already have FUSE_HANDLE_KILLPRIV flag that says that file server will
remove suid/sgid/caps on truncate/chown/write. But that's little different
from what Linux VFS implements.

To be consistent with Linux VFS behavior what we want is.

- caps are always cleared on chown/write/truncate
- suid is always cleared on chown, while for truncate/write it is cleared
  only if caller does not have CAP_FSETID.
- sgid is always cleared on chown, while for truncate/write it is cleared
  only if caller does not have CAP_FSETID as well as file has group execute
  permission.

As previous flag did not provide above semantics. Implement a V2 of the
protocol with above said constraints.

Server does not know if caller has CAP_FSETID or not. So for the case
of write()/truncate(), client will send information in special flag to
indicate whether to kill priviliges or not. These changes are in subsequent
patches.

FUSE_HANDLE_KILLPRIV_V2 relies on WRITE being sent to server to clear
suid/sgid/security.capability. But with ->writeback_cache, WRITES are
cached in guest. So it is not recommended to use FUSE_HANDLE_KILLPRIV_V2
and writeback_cache together. Though it probably might be good enough
for lot of use cases.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

63f9909f

fuse: add fuse_sb_destroy() helper · 6a68d1e1

由 Miklos Szeredi 提交于 11月 11, 2020

This is to avoid minor code duplication between fuse_kill_sb_anon() and
fuse_kill_sb_blk().
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

6a68d1e1

fuse: get rid of fuse_mount refcount · 514b5e3f

由 Miklos Szeredi 提交于 11月 11, 2020

Fuse mount now only ever has a refcount of one (before being freed) so the
count field is unnecessary.

Remove the refcounting and fold fuse_mount_put() into callers.  The only
caller of fuse_mount_put() where fm->fc was NULL is fuse_dentry_automount()
and here the fuse_conn_put() can simply be omitted.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

514b5e3f

virtiofs: simplify sb setup · b19d3d00

由 Miklos Szeredi 提交于 11月 11, 2020

Currently when acquiring an sb for virtiofs fuse_mount_get() is being
called from virtio_fs_set_super() if a new sb is being filled and
fuse_mount_put() is called unconditionally after sget_fc() returns.

The exact same result can be obtained by checking whether
fs_contex->s_fs_info was set to NULL (ref trasferred to sb->s_fs_info) and
only calling fuse_mount_put() if the ref wasn't transferred (error or
matching sb found).

This allows getting rid of virtio_fs_set_super() and fuse_mount_get().
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

b19d3d00

12 10月, 2020 1 次提交

fuse: connection remove fix · 413daa1a

由 Miklos Szeredi 提交于 10月 09, 2020

Re-add lost removal of fc from fuse_conn_list and the control filesystem.
Reported-by: Nkernel test robot <rong.a.chen@intel.com>
Fixes: fcee216b ("fuse: split fuse_mount off of fuse_conn")
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

413daa1a

09 10月, 2020 1 次提交

fuse: implement crossmounts · bf109c64

由 Max Reitz 提交于 4月 21, 2020

FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT
in fuse_attr.flags.  The inode will then be marked as S_AUTOMOUNT, and the
.d_automount implementation creates a new submount at that location, so
that the submount gets a distinct st_dev value.

Note that all submounts get a distinct superblock and a distinct st_dev
value, so for virtio-fs, even if the same filesystem is mounted more than
once on the host, none of its mount points will have the same st_dev.  We
need distinct superblocks because the superblock points to the root node,
but the different host mounts may show different trees (e.g. due to
submounts in some of them, but not in others).

Right now, this behavior is only enabled when fuse_conn.auto_submounts is
set, which is the case only for virtio-fs.
Signed-off-by: NMax Reitz <mreitz@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

bf109c64

25 9月, 2020 2 次提交

bdi: invert BDI_CAP_NO_ACCT_WB · 823423ef

由 Christoph Hellwig 提交于 9月 24, 2020

Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious.  Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

823423ef

bdi: initialize ->ra_pages and ->io_pages in bdi_init · 55b2598e

由 Christoph Hellwig 提交于 9月 24, 2020

Set up a readahead size by default, as very few users have a good
reason to change it.  This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

55b2598e

18 9月, 2020 2 次提交

fuse: Allow fuse_fill_super_common() for submounts · 1866d779

由 Max Reitz 提交于 9月 09, 2020

Submounts have their own superblock, which needs to be initialized.
However, they do not have a fuse_fs_context associated with them, and
the root node's attributes should be taken from the mountpoint's node.

Extend fuse_fill_super_common() to work for submounts by making the @ctx
parameter optional, and by adding a @submount_finode parameter.

(There is a plain "unsigned" in an existing code block that is being
indented by this commit.  Extend it to "unsigned int" so checkpatch does
not complain.)
Signed-off-by: NMax Reitz <mreitz@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

1866d779

fuse: split fuse_mount off of fuse_conn · fcee216b

由 Max Reitz 提交于 5月 06, 2020

We want to allow submounts for the same fuse_conn, but with different
superblocks so that each of the submounts has its own device ID.  To do
so, we need to split all mount-specific information off of fuse_conn
into a new fuse_mount structure, so that multiple mounts can share a
single fuse_conn.

We need to take care only to perform connection-level actions once (i.e.
when the fuse_conn and thus the first fuse_mount are established, or
when the last fuse_mount and thus the fuse_conn are destroyed).  For
example, fuse_sb_destroy() must invoke fuse_send_destroy() until the
last superblock is released.

To do so, we keep track of which fuse_mount is the root mount and
perform all fuse_conn-level actions only when this fuse_mount is
involved.
Signed-off-by: NMax Reitz <mreitz@redhat.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

fcee216b

10 9月, 2020 5 次提交

virtiofs: serialize truncate/punch_hole and dax fault path · 6ae330ca

由 Vivek Goyal 提交于 8月 19, 2020

Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.

1. Dax requirement

  DAX fault code relies on inode size being stable for the duration of
  fault and want to serialize with truncate/punch_hole and they explicitly
  mention it.

  static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
                               const struct iomap_ops *ops)
        /*
         * Check whether offset isn't beyond end of file now. Caller is
         * supposed to hold locks serializing us with truncate / punch hole so
         * this is a reliable test.
         */
        max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);

2. Make sure there are no users of pages being truncated/punch_hole

  get_user_pages() might take references to page and then do some DMA
  to said pages. Filesystem might truncate those pages without knowing
  that a DMA is in progress or some I/O is in progress. So use
  dax_layout_busy_page() to make sure there are no such references
  and I/O is not in progress on said pages before moving ahead with
  truncation.

3. Limitation of kvm page fault error reporting

  If we are truncating file on host first and then removing mappings in
  guest lateter (truncate page cache etc), then this could lead to a
  problem with KVM. Say a mapping is in place in guest and truncation
  happens on host. Now if guest accesses that mapping, then host will
  take a fault and kvm will either exit to qemu or spin infinitely.

  IOW, before we do truncation on host, we need to make sure that guest
  inode does not have any mapping in that region or whole file.

4. virtiofs memory range reclaim

 Soon I will introduce the notion of being able to reclaim dax memory
 ranges from a fuse dax inode. There also I need to make sure that
 no I/O or fault is going on in the reclaimed range and nobody is using
 it so that range can be reclaimed without issues.

Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose.  It can be used to serialize with faults.

As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.

Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

6ae330ca

virtiofs: implement dax read/write operations · c2d0ad00

由 Vivek Goyal 提交于 8月 19, 2020

This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.

We make use of interval tree to keep track of per inode dax mappings.

Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NPeng Tao <tao.peng@linux.alibaba.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

c2d0ad00

virtiofs: implement FUSE_INIT map_alignment field · fd1a1dc6

由 Stefan Hajnoczi 提交于 8月 19, 2020

The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field.  Parse this field and
ensure our DAX mappings meet the alignment constraints.

We don't actually align anything differently since our mappings are
already 2MB aligned.  Just check the value when the connection is
established.  If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.

The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

fd1a1dc6

virtiofs: add a mount option to enable dax · 1dd53957

由 Vivek Goyal 提交于 8月 19, 2020

Add a mount option to allow using dax with virtio_fs.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

1dd53957

virtiofs: get rid of no_mount_options · f4fd4ae3

由 Vivek Goyal 提交于 8月 19, 2020

This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.

Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

f4fd4ae3

14 7月, 2020 3 次提交

fuse: reject options on reconfigure via fsconfig(2) · b330966f

由 Miklos Szeredi 提交于 7月 14, 2020

Previous patch changed handling of remount/reconfigure to ignore all
options, including those that are unknown to the fuse kernel fs.  This was
done for backward compatibility, but this likely only affects the old
mount(2) API.

The new fsconfig(2) based reconfiguration could possibly be improved.  This
would make the new API less of a drop in replacement for the old, OTOH this
is a good chance to get rid of some weirdnesses in the old API.

Several other behaviors might make sense:

 1) unknown options are rejected, known options are ignored

 2) unknown options are rejected, known options are rejected if the value
 is changed, allowed otherwise

 3) all options are rejected

Prior to the backward compatibility fix to ignore all options all known
options were accepted (1), even if they change the value of a mount
parameter; fuse_reconfigure() does not look at the config values set by
fuse_parse_param().

To fix that we'd need to verify that the value provided is the same as set
in the initial configuration (2).  The major drawback is that this is much
more complex than just rejecting all attempts at changing options (3);
i.e. all options signify initial configuration values and don't make sense
on reconfigure.

This patch opts for (3) with the rationale that no mount options are
reconfigurable in fuse.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

b330966f

fuse: ignore 'data' argument of mount(..., MS_REMOUNT) · e8b20a47

由 Miklos Szeredi 提交于 7月 14, 2020

The command

  mount -o remount -o unknownoption /mnt/fuse

succeeds on kernel versions prior to v5.4 and fails on kernel version at or
after.  This is because fuse_parse_param() rejects any unrecognised options
in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT.

This causes a regression in case the fuse filesystem is in fstab, since
remount sends all options found there to the kernel; even ones that are
meant for the initial mount and are consumed by the userspace fuse server.

Fix this by ignoring mount options, just as fuse_remount_fs() did prior to
the conversion to the new API.
Reported-by: NStefan Priebe <s.priebe@profihost.ag>
Fixes: c30da2e9 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

e8b20a47

fuse: use ->reconfigure() instead of ->remount_fs() · 0189a2d3

由 Miklos Szeredi 提交于 7月 14, 2020

s_op->remount_fs() is only called from legacy_reconfigure(), which is not
used after being converted to the new API.

Convert to using ->reconfigure().  This restores the previous behavior of
syncing the filesystem and rejecting MS_MANDLOCK on remount.

Fixes: c30da2e9 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

0189a2d3

19 5月, 2020 2 次提交

fuse: update attr_version counter on fuse_notify_inval_inode() · 5ddd9ced

由 Miklos Szeredi 提交于 5月 19, 2020

A GETATTR request can race with FUSE_NOTIFY_INVAL_INODE, resulting in the
attribute cache being updated with stale information after the
invalidation.

Fix this by bumping the attribute version in fuse_reverse_inval_inode().
Reported-by: NKrzysztof Rusek <rusek@9livesdata.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

5ddd9ced

virtiofs: do not use fuse_fill_super_common() for device installation · 7fd3abfa

由 Vivek Goyal 提交于 5月 04, 2020

fuse_fill_super_common() allocates and installs one fuse_device.  Hence
virtiofs allocates and install all fuse devices by itself except one.

This makes logic little twisted.  There does not seem to be any real need
that why virtiofs can't allocate and install all fuse devices itself.

So opt out of fuse device allocation and installation while calling
fuse_fill_super_common().

Regular fuse still wants fuse_fill_super_common() to install fuse_device.
It needs to prevent against races where two mounters are trying to mount
fuse using same fd.  In that case one will succeed while other will get
-EINVAL.

virtiofs does not have this issue because sget_fc() resolves the race
w.r.t multiple mounters and only one instance of virtio_fs_fill_super()
should be in progress for same filesystem.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

7fd3abfa

08 2月, 2020 3 次提交

A
fuse: switch to use errorfc() et.al. · 2e28c49e
由 Al Viro 提交于 12月 21, 2019
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
2e28c49e

fs_parse: fold fs_parameter_desc/fs_parameter_spec · d7167b14

由 Al Viro 提交于 9月 07, 2019

The former contains nothing but a pointer to an array of the latter...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d7167b14

fs_parser: remove fs_parameter_description name field · 96cafb9c

由 Eric Sandeen 提交于 12月 06, 2019

Unused now.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

96cafb9c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功