1. 10 9月, 2020 5 次提交
    • V
      virtiofs: serialize truncate/punch_hole and dax fault path · 6ae330ca
      Vivek Goyal 提交于
      Currently in fuse we don't seem have any lock which can serialize fault
      path with truncate/punch_hole path. With dax support I need one for
      following reasons.
      
      1. Dax requirement
      
        DAX fault code relies on inode size being stable for the duration of
        fault and want to serialize with truncate/punch_hole and they explicitly
        mention it.
      
        static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
                                     const struct iomap_ops *ops)
              /*
               * Check whether offset isn't beyond end of file now. Caller is
               * supposed to hold locks serializing us with truncate / punch hole so
               * this is a reliable test.
               */
              max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
      
      2. Make sure there are no users of pages being truncated/punch_hole
      
        get_user_pages() might take references to page and then do some DMA
        to said pages. Filesystem might truncate those pages without knowing
        that a DMA is in progress or some I/O is in progress. So use
        dax_layout_busy_page() to make sure there are no such references
        and I/O is not in progress on said pages before moving ahead with
        truncation.
      
      3. Limitation of kvm page fault error reporting
      
        If we are truncating file on host first and then removing mappings in
        guest lateter (truncate page cache etc), then this could lead to a
        problem with KVM. Say a mapping is in place in guest and truncation
        happens on host. Now if guest accesses that mapping, then host will
        take a fault and kvm will either exit to qemu or spin infinitely.
      
        IOW, before we do truncation on host, we need to make sure that guest
        inode does not have any mapping in that region or whole file.
      
      4. virtiofs memory range reclaim
      
       Soon I will introduce the notion of being able to reclaim dax memory
       ranges from a fuse dax inode. There also I need to make sure that
       no I/O or fault is going on in the reclaimed range and nobody is using
       it so that range can be reclaimed without issues.
      
      Currently if we take inode lock, that serializes read/write. But it does
      not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
      for this purpose.  It can be used to serialize with faults.
      
      As of now, I am adding taking this semaphore only in dax fault path and
      not regular fault path because existing code does not have one. May
      be existing code can benefit from it as well to take care of some
      races, but that we can fix later if need be. For now, I am just focussing
      only on DAX path which is new path.
      
      Also added logic to take fuse_inode->i_mmap_sem in
      truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
      fuse dax fault are mutually exlusive and avoid all the above problems.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6ae330ca
    • V
      virtiofs: implement dax read/write operations · c2d0ad00
      Vivek Goyal 提交于
      This patch implements basic DAX support. mmap() is not implemented
      yet and will come in later patches. This patch looks into implemeting
      read/write.
      
      We make use of interval tree to keep track of per inode dax mappings.
      
      Do not use dax for file extending writes, instead just send WRITE message
      to daemon (like we do for direct I/O path). This will keep write and
      i_size change atomic w.r.t crash.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NPeng Tao <tao.peng@linux.alibaba.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c2d0ad00
    • S
      virtiofs: implement FUSE_INIT map_alignment field · fd1a1dc6
      Stefan Hajnoczi 提交于
      The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
      constraints via the FUST_INIT map_alignment field.  Parse this field and
      ensure our DAX mappings meet the alignment constraints.
      
      We don't actually align anything differently since our mappings are
      already 2MB aligned.  Just check the value when the connection is
      established.  If it becomes necessary to honor arbitrary alignments in
      the future we'll have to adjust how mappings are sized.
      
      The upshot of this commit is that we can be confident that mappings will
      work even when emulating x86 on Power and similar combinations where the
      host page sizes are different.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      fd1a1dc6
    • V
      virtiofs: add a mount option to enable dax · 1dd53957
      Vivek Goyal 提交于
      Add a mount option to allow using dax with virtio_fs.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      1dd53957
    • V
      virtiofs: get rid of no_mount_options · f4fd4ae3
      Vivek Goyal 提交于
      This option was introduced so that for virtio_fs we don't show any mounts
      options fuse_show_options(). Because we don't offer any of these options
      to be controlled by mounter.
      
      Very soon we are planning to introduce option "dax" which mounter should
      be able to specify. And no_mount_options does not work anymore.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f4fd4ae3
  2. 19 5月, 2020 1 次提交
    • M
      fuse: optimize writepages search · 6b2fb799
      Maxim Patlasov 提交于
      Re-work fi->writepages, replacing list with rb-tree.  This improves
      performance because kernel fuse iterates through fi->writepages for each
      writeback page and typical number of entries is about 800 (for 100MB of
      fuse writeback).
      
      Before patch:
      
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 41.3473 s, 260 MB/s
      
       2  1      0 57445400  40416 6323676    0    0    33 374743 8633 19210  1  8 88  3  0
      
        29.86%  [kernel]               [k] _raw_spin_lock
        26.62%  [fuse]                 [k] fuse_page_is_writeback
      
      After patch:
      
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 21.4954 s, 500 MB/s
      
       2  9      0 53676040  31744 10265984    0    0    64 854790 10956 48387  1  6 88  6  0
      
        23.55%  [kernel]             [k] copy_user_enhanced_fast_string
         9.87%  [kernel]             [k] __memcpy
         3.10%  [kernel]             [k] _raw_spin_lock
      Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6b2fb799
  3. 20 4月, 2020 1 次提交
    • V
      virtiofs: schedule blocking async replies in separate worker · bb737bbe
      Vivek Goyal 提交于
      In virtiofs (unlike in regular fuse) processing of async replies is
      serialized.  This can result in a deadlock in rare corner cases when
      there's a circular dependency between the completion of two or more async
      replies.
      
      Such a deadlock can be reproduced with xfstests:generic/503 if TEST_DIR ==
      SCRATCH_MNT (which is a misconfiguration):
      
       - Process A is waiting for page lock in worker thread context and blocked
         (virtio_fs_requests_done_work()).
       - Process B is holding page lock and waiting for pending writes to
         finish (fuse_wait_on_page_writeback()).
       - Write requests are waiting in virtqueue and can't complete because
         worker thread is blocked on page lock (process A).
      
      Fix this by creating a unique work_struct for each async reply that can
      block (O_DIRECT read).
      
      Fixes: a62a8ef9 ("virtio-fs: add virtiofs filesystem")
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      bb737bbe
  4. 13 2月, 2020 1 次提交
  5. 12 11月, 2019 1 次提交
  6. 15 10月, 2019 1 次提交
  7. 19 9月, 2019 1 次提交
    • S
      virtio-fs: add virtiofs filesystem · a62a8ef9
      Stefan Hajnoczi 提交于
      Add a basic file system module for virtio-fs.  This does not yet contain
      shared data support between host and guest or metadata coherency speedups.
      However it is already significantly faster than virtio-9p.
      
      Design Overview
      ===============
      
      With the goal of designing something with better performance and local file
      system semantics, a bunch of ideas were proposed.
      
       - Use fuse protocol (instead of 9p) for communication between guest and
         host.  Guest kernel will be fuse client and a fuse server will run on
         host to serve the requests.
      
       - For data access inside guest, mmap portion of file in QEMU address space
         and guest accesses this memory using dax.  That way guest page cache is
         bypassed and there is only one copy of data (on host).  This will also
         enable mmap(MAP_SHARED) between guests.
      
       - For metadata coherency, there is a shared memory region which contains
         version number associated with metadata and any guest changing metadata
         updates version number and other guests refresh metadata on next access.
         This is yet to be implemented.
      
      How virtio-fs differs from existing approaches
      ==============================================
      
      The unique idea behind virtio-fs is to take advantage of the co-location of
      the virtual machine and hypervisor to avoid communication (vmexits).
      
      DAX allows file contents to be accessed without communication with the
      hypervisor.  The shared memory region for metadata avoids communication in
      the common case where metadata is unchanged.
      
      By replacing expensive communication with cheaper shared memory accesses,
      we expect to achieve better performance than approaches based on network
      file system protocols.  In addition, this also makes it easier to achieve
      local file system semantics (coherency).
      
      These techniques are not applicable to network file system protocols since
      the communications channel is bypassed by taking advantage of shared memory
      on a local machine.  This is why we decided to build virtio-fs rather than
      focus on 9P or NFS.
      
      Caching Modes
      =============
      
      Like virtio-9p, different caching modes are supported which determine the
      coherency level as well.  The “cache=FOO” and “writeback” options control
      the level of coherence between the guest and host filesystems.
      
       - cache=none
         metadata, data and pathname lookup are not cached in guest.  They are
         always fetched from host and any changes are immediately pushed to host.
      
       - cache=always
         metadata, data and pathname lookup are cached in guest and never expire.
      
       - cache=auto
         metadata and pathname lookup cache expires after a configured amount of
         time (default is 1 second).  Data is cached while the file is open
         (close to open consistency).
      
       - writeback/no_writeback
         These options control the writeback strategy.  If writeback is disabled,
         then normal writes will immediately be synchronized with the host fs.
         If writeback is enabled, then writes may be cached in the guest until
         the file is closed or an fsync(2) performed.  This option has no effect
         on mmap-ed writes or writes going through the DAX mechanism.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a62a8ef9
  8. 12 9月, 2019 11 次提交
  9. 10 9月, 2019 18 次提交