1. 10 12月, 2020 1 次提交
    • M
      fuse: fix bad inode · 5d069dbe
      Miklos Szeredi 提交于
      Jan Kara's analysis of the syzbot report (edited):
      
        The reproducer opens a directory on FUSE filesystem, it then attaches
        dnotify mark to the open directory.  After that a fuse_do_getattr() call
        finds that attributes returned by the server are inconsistent, and calls
        make_bad_inode() which, among other things does:
      
                inode->i_mode = S_IFREG;
      
        This then confuses dnotify which doesn't tear down its structures
        properly and eventually crashes.
      
      Avoid calling make_bad_inode() on a live inode: switch to a private flag on
      the fuse inode.  Also add the test to ops which the bad_inode_ops would
      have caught.
      
      This bug goes back to the initial merge of fuse in 2.6.14...
      
      Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Tested-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      5d069dbe
  2. 12 11月, 2020 5 次提交
    • V
      fuse: add a flag FUSE_OPEN_KILL_SUIDGID for open() request · 643a666a
      Vivek Goyal 提交于
      With FUSE_HANDLE_KILLPRIV_V2 support, server will need to kill suid/sgid/
      security.capability on open(O_TRUNC), if server supports
      FUSE_ATOMIC_O_TRUNC.
      
      But server needs to kill suid/sgid only if caller does not have CAP_FSETID.
      Given server does not have this information, client needs to send this info
      to server.
      
      So add a flag FUSE_OPEN_KILL_SUIDGID to fuse_open_in request which tells
      server to kill suid/sgid (only if group execute is set).
      
      This flag is added to the FUSE_OPEN request, as well as the FUSE_CREATE
      request if the create was non-exclusive, since that might result in an
      existing file being opened/truncated.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      643a666a
    • V
      fuse: don't send ATTR_MODE to kill suid/sgid for handle_killpriv_v2 · 8981bdfd
      Vivek Goyal 提交于
      If client does a write() on a suid/sgid file, VFS will first call
      fuse_setattr() with ATTR_KILL_S[UG]ID set.  This requires sending setattr
      to file server with ATTR_MODE set to kill suid/sgid.  But to do that client
      needs to know latest mode otherwise it is racy.
      
      To reduce the race window, current code first call fuse_do_getattr() to get
      latest ->i_mode and then resets suid/sgid bits and sends rest to server
      with setattr(ATTR_MODE).  This does not reduce the race completely but
      narrows race window significantly.
      
      With fc->handle_killpriv_v2 enabled, it should be possible to remove this
      race completely.  Do not kill suid/sgid with ATTR_MODE at all.  It will be
      killed by server when WRITE request is sent to server soon.  This is
      similar to fc->handle_killpriv logic.  V2 is just more refined version of
      protocol.  Hence this patch does not send ATTR_MODE to kill suid/sgid if
      fc->handle_killpriv_v2 is enabled.
      
      This creates an issue if fc->writeback_cache is enabled.  In that case
      WRITE can be cached in guest and server might not see WRITE request and
      hence will not kill suid/sgid.  Miklos suggested that in such cases, we
      should fallback to a writethrough WRITE instead and that will generate
      WRITE request and kill suid/sgid.  This patch implements that too.
      
      But this relies on client seeing the suid/sgid set.  If another client sets
      suid/sgid and this client does not see it immideately, then we will not
      fallback to writethrough WRITE.  So this is one limitation with both
      fc->handle_killpriv_v2 and fc->writeback_cache enabled.  Both the options
      are not fully compatible.  But might be good enough for many use cases.
      
      Note: This patch is not checking whether security.capability is set or not
            when falling back to writethrough path.  If suid/sgid is not set and
            only security.capability is set, that will be taken care of by
            file_remove_privs() call in ->writeback_cache path.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      8981bdfd
    • V
      fuse: set FUSE_WRITE_KILL_SUIDGID in cached write path · b8667395
      Vivek Goyal 提交于
      With HANDLE_KILLPRIV_V2, server will need to kill suid/sgid if caller does
      not have CAP_FSETID.  We already have a flag FUSE_WRITE_KILL_SUIDGID in
      WRITE request and we already set it in direct I/O path.
      
      To make it work in cached write path also, start setting
      FUSE_WRITE_KILL_SUIDGID in this path too.
      
      Set it only if fc->handle_killpriv_v2 is set.  Otherwise client is
      responsible for kill suid/sgid.
      
      In case of direct I/O we set FUSE_WRITE_KILL_SUIDGID unconditionally
      because we don't call file_remove_privs() in that path (with cache=none
      option).
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      b8667395
    • M
      fuse: rename FUSE_WRITE_KILL_PRIV to FUSE_WRITE_KILL_SUIDGID · 10c52c84
      Miklos Szeredi 提交于
      Kernel has:
      ATTR_KILL_PRIV -> clear "security.capability"
      ATTR_KILL_SUID -> clear S_ISUID
      ATTR_KILL_SGID -> clear S_ISGID if executable
      
      Fuse has:
      FUSE_WRITE_KILL_PRIV -> clear S_ISUID and S_ISGID if executable
      
      So FUSE_WRITE_KILL_PRIV implies the complement of ATTR_KILL_PRIV, which is
      somewhat confusing.  Also PRIV implies all privileges, including
      "security.capability".
      
      Change the name to FUSE_WRITE_KILL_SUIDGID and make FUSE_WRITE_KILL_PRIV an
      alias to perserve API compatibility
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      10c52c84
    • M
      fuse: launder page should wait for page writeback · 3993382b
      Miklos Szeredi 提交于
      Qian Cai reports that the WARNING in tree_insert() can be triggered by a
      fuzzer with the following call chain:
      
      invalidate_inode_pages2_range()
         fuse_launder_page()
            fuse_writepage_locked()
               tree_insert()
      
      The reason is that another write for the same page is already queued.
      
      The simplest fix is to wait until the pending write is completed and only
      after that queue the new write.
      
      Since this case is very rare, the additional wait should not be a problem.
      Reported-by: NQian Cai <cai@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      3993382b
  3. 18 9月, 2020 2 次提交
  4. 10 9月, 2020 3 次提交
    • V
      virtiofs: serialize truncate/punch_hole and dax fault path · 6ae330ca
      Vivek Goyal 提交于
      Currently in fuse we don't seem have any lock which can serialize fault
      path with truncate/punch_hole path. With dax support I need one for
      following reasons.
      
      1. Dax requirement
      
        DAX fault code relies on inode size being stable for the duration of
        fault and want to serialize with truncate/punch_hole and they explicitly
        mention it.
      
        static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
                                     const struct iomap_ops *ops)
              /*
               * Check whether offset isn't beyond end of file now. Caller is
               * supposed to hold locks serializing us with truncate / punch hole so
               * this is a reliable test.
               */
              max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
      
      2. Make sure there are no users of pages being truncated/punch_hole
      
        get_user_pages() might take references to page and then do some DMA
        to said pages. Filesystem might truncate those pages without knowing
        that a DMA is in progress or some I/O is in progress. So use
        dax_layout_busy_page() to make sure there are no such references
        and I/O is not in progress on said pages before moving ahead with
        truncation.
      
      3. Limitation of kvm page fault error reporting
      
        If we are truncating file on host first and then removing mappings in
        guest lateter (truncate page cache etc), then this could lead to a
        problem with KVM. Say a mapping is in place in guest and truncation
        happens on host. Now if guest accesses that mapping, then host will
        take a fault and kvm will either exit to qemu or spin infinitely.
      
        IOW, before we do truncation on host, we need to make sure that guest
        inode does not have any mapping in that region or whole file.
      
      4. virtiofs memory range reclaim
      
       Soon I will introduce the notion of being able to reclaim dax memory
       ranges from a fuse dax inode. There also I need to make sure that
       no I/O or fault is going on in the reclaimed range and nobody is using
       it so that range can be reclaimed without issues.
      
      Currently if we take inode lock, that serializes read/write. But it does
      not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
      for this purpose.  It can be used to serialize with faults.
      
      As of now, I am adding taking this semaphore only in dax fault path and
      not regular fault path because existing code does not have one. May
      be existing code can benefit from it as well to take care of some
      races, but that we can fix later if need be. For now, I am just focussing
      only on DAX path which is new path.
      
      Also added logic to take fuse_inode->i_mmap_sem in
      truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
      fuse dax fault are mutually exlusive and avoid all the above problems.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6ae330ca
    • S
      virtiofs: add DAX mmap support · 2a9a609a
      Stefan Hajnoczi 提交于
      Add DAX mmap() support.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2a9a609a
    • V
      virtiofs: implement dax read/write operations · c2d0ad00
      Vivek Goyal 提交于
      This patch implements basic DAX support. mmap() is not implemented
      yet and will come in later patches. This patch looks into implemeting
      read/write.
      
      We make use of interval tree to keep track of per inode dax mappings.
      
      Do not use dax for file extending writes, instead just send WRITE message
      to daemon (like we do for direct I/O path). This will keep write and
      i_size change atomic w.r.t crash.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NPeng Tao <tao.peng@linux.alibaba.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c2d0ad00
  5. 17 7月, 2020 1 次提交
  6. 15 7月, 2020 1 次提交
  7. 14 7月, 2020 4 次提交
    • V
      fuse: don't ignore errors from fuse_writepages_fill() · 7779b047
      Vasily Averin 提交于
      fuse_writepages() ignores some errors taken from fuse_writepages_fill() I
      believe it is a bug: if .writepages is called with WB_SYNC_ALL it should
      either guarantee that all data was successfully saved or return error.
      
      Fixes: 26d614df ("fuse: Implement writepages callback")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7779b047
    • M
      fuse: clean up condition for writepage sending · 6ddf3af9
      Miklos Szeredi 提交于
      fuse_writepages_fill uses following construction:
      
      if (wpa && ap->num_pages &&
          (A || B || C)) {
              action;
      } else if (wpa && D) {
              if (E) {
                      the same action;
              }
      }
      
       - ap->num_pages check is always true and can be removed
      
       - "if" and "else if" calls the same action and can be merged.
      
      Move checking A, B, C, D, E conditions to a helper, add comments.
      Original-patch-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6ddf3af9
    • M
      fuse: fix warning in tree_insert() and clean up writepage insertion · c146024e
      Miklos Szeredi 提交于
      fuse_writepages_fill() calls tree_insert() with ap->num_pages = 0 which
      triggers the following warning:
      
       WARNING: CPU: 1 PID: 17211 at fs/fuse/file.c:1728 tree_insert+0xab/0xc0 [fuse]
       RIP: 0010:tree_insert+0xab/0xc0 [fuse]
       Call Trace:
        fuse_writepages_fill+0x5da/0x6a0 [fuse]
        write_cache_pages+0x171/0x470
        fuse_writepages+0x8a/0x100 [fuse]
        do_writepages+0x43/0xe0
      
      Fix up the warning and clean up the code around rb-tree insertion:
      
       - Rename tree_insert() to fuse_insert_writeback() and make it return the
         conflicting entry in case of failure
      
       - Re-add tree_insert() as a wrapper around fuse_insert_writeback()
      
       - Rename fuse_writepage_in_flight() to fuse_writepage_add() and reverse
         the meaning of the return value to mean
      
          + "true" in case the writepage entry was successfully added
      
          + "false" in case it was in-fligt queued on an existing writepage
             entry's auxiliary list or the existing writepage entry's temporary
             page updated
      
         Switch from fuse_find_writeback() + tree_insert() to
         fuse_insert_writeback()
      
       - Move setting orig_pages to before inserting/updating the entry; this may
         result in the orig_pages value being discarded later in case of an
         in-flight request
      
       - In case of a new writepage entry use fuse_writepage_add()
         unconditionally, only set data->wpa if the entry was added.
      
      Fixes: 6b2fb799 ("fuse: optimize writepages search")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Original-path-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c146024e
    • M
      fuse: move rb_erase() before tree_insert() · 69a6487a
      Miklos Szeredi 提交于
      In fuse_writepage_end() the old writepages entry needs to be removed from
      the rbtree before inserting the new one, otherwise tree_insert() would
      fail.  This is a very rare codepath and no reproducer exists.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      69a6487a
  8. 03 6月, 2020 1 次提交
  9. 20 5月, 2020 2 次提交
    • M
      fuse: copy_file_range should truncate cache · 9b46418c
      Miklos Szeredi 提交于
      After the copy operation completes the cache is not up-to-date.  Truncate
      all pages in the interval that has successfully been copied.
      
      Truncating completely copied dirty pages is okay, since the data has been
      overwritten anyway.  Truncating partially copied dirty pages is not okay;
      add a comment for now.
      
      Fixes: 88bc7d50 ("fuse: add support for copy_file_range()")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      9b46418c
    • M
      fuse: fix copy_file_range cache issues · 2c4656df
      Miklos Szeredi 提交于
      a) Dirty cache needs to be written back not just in the writeback_cache
      case, since the dirty pages may come from memory maps.
      
      b) The fuse_writeback_range() helper takes an inclusive interval, so the
      end position needs to be pos+len-1 instead of pos+len.
      
      Fixes: 88bc7d50 ("fuse: add support for copy_file_range()")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2c4656df
  10. 19 5月, 2020 3 次提交
    • M
      fuse: optimize writepages search · 6b2fb799
      Maxim Patlasov 提交于
      Re-work fi->writepages, replacing list with rb-tree.  This improves
      performance because kernel fuse iterates through fi->writepages for each
      writeback page and typical number of entries is about 800 (for 100MB of
      fuse writeback).
      
      Before patch:
      
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 41.3473 s, 260 MB/s
      
       2  1      0 57445400  40416 6323676    0    0    33 374743 8633 19210  1  8 88  3  0
      
        29.86%  [kernel]               [k] _raw_spin_lock
        26.62%  [fuse]                 [k] fuse_page_is_writeback
      
      After patch:
      
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 21.4954 s, 500 MB/s
      
       2  9      0 53676040  31744 10265984    0    0    64 854790 10956 48387  1  6 88  6  0
      
        23.55%  [kernel]             [k] copy_user_enhanced_fast_string
         9.87%  [kernel]             [k] __memcpy
         3.10%  [kernel]             [k] _raw_spin_lock
      Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      6b2fb799
    • M
      fuse: always flush dirty data on close(2) · 614c026e
      Miklos Szeredi 提交于
      We want cached data to synced with the userspace filesystem on close(), for
      example to allow getting correct st_blocks value.  Do this regardless of
      whether the userspace filesystem implements a FLUSH method or not.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      614c026e
    • E
      fuse: invalidate inode attr in writeback cache mode · cf576c58
      Eryu Guan 提交于
      Under writeback mode, inode->i_blocks is not updated, making utils du
      read st.blocks as 0.
      
      For example, when using virtiofs (cache=always & nondax mode) with
      writeback_cache enabled, writing a new file and check its disk usage
      with du, du reports 0 usage.
      
        # uname -r
        5.6.0-rc6+
        # mount -t virtiofs virtiofs /mnt/virtiofs
        # rm -f /mnt/virtiofs/testfile
      
        # create new file and do extend write
        # xfs_io -fc "pwrite 0 4k" /mnt/virtiofs/testfile
        wrote 4096/4096 bytes at offset 0
        4 KiB, 1 ops; 0.0001 sec (28.103 MiB/sec and 7194.2446 ops/sec)
        # du -k /mnt/virtiofs/testfile
        0               <==== disk usage is 0
        # stat -c %s,%b /mnt/virtiofs/testfile
        4096,0          <==== i_size is correct, but st_blocks is 0
      
      Fix it by invalidating attr in fuse_flush(), so we get up-to-date attr
      from server on next getattr.
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      cf576c58
  11. 20 4月, 2020 1 次提交
    • V
      virtiofs: schedule blocking async replies in separate worker · bb737bbe
      Vivek Goyal 提交于
      In virtiofs (unlike in regular fuse) processing of async replies is
      serialized.  This can result in a deadlock in rare corner cases when
      there's a circular dependency between the completion of two or more async
      replies.
      
      Such a deadlock can be reproduced with xfstests:generic/503 if TEST_DIR ==
      SCRATCH_MNT (which is a misconfiguration):
      
       - Process A is waiting for page lock in worker thread context and blocked
         (virtio_fs_requests_done_work()).
       - Process B is holding page lock and waiting for pending writes to
         finish (fuse_wait_on_page_writeback()).
       - Write requests are waiting in virtqueue and can't complete because
         worker thread is blocked on page lock (process A).
      
      Fix this by creating a unique work_struct for each async reply that can
      block (O_DIRECT read).
      
      Fixes: a62a8ef9 ("virtio-fs: add virtiofs filesystem")
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      bb737bbe
  12. 06 2月, 2020 3 次提交
    • Z
      fuse: use true,false for bool variable · cabdb4fa
      zhengbin 提交于
      Fixes coccicheck warning:
      
      fs/fuse/readdir.c:335:1-19: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/file.c:1398:2-19: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/file.c:1400:2-20: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/cuse.c:454:1-20: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/cuse.c:455:1-19: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/inode.c:497:2-17: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/inode.c:504:2-23: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/inode.c:511:2-22: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/inode.c:518:2-23: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/inode.c:522:2-26: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/inode.c:526:2-18: WARNING: Assignment of 0/1 to bool variable
      fs/fuse/inode.c:1000:1-20: WARNING: Assignment of 0/1 to bool variable
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      cabdb4fa
    • M
      fuse: don't overflow LLONG_MAX with end offset · 2f139829
      Miklos Szeredi 提交于
      Handle the special case of fuse_readpages() wanting to read the last page
      of a hugest file possible and overflowing the end offset in the process.
      
      This is basically to unbreak xfstests:generic/525 and prevent filesystems
      from doing bad things with an overflowing offset.
      Reported-by: NXiao Yang <ice_yangxiao@163.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2f139829
    • M
      fix up iter on short count in fuse_direct_io() · f658adee
      Miklos Szeredi 提交于
      fuse_direct_io() can end up advancing the iterator by more than the amount
      of data read or written.  This case is handled by the generic code if going
      through ->direct_IO(), but not in the FOPEN_DIRECT_IO case.
      
      Fix by reverting the extra bytes from the iterator in case of error or a
      short count.
      
      To test: install lxcfs, then the following testcase
        int fd = open("/var/lib/lxcfs/proc/uptime", O_RDONLY);
        sendfile(1, fd, NULL, 16777216);
        sendfile(1, fd, NULL, 16777216);
      will spew WARN_ON() in iov_iter_pipe().
      Reported-by: NPeter Geis <pgwipeout@gmail.com>
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Fixes: 3c3db095 ("fuse: use iov_iter based generic splice helpers")
      Cc: <stable@vger.kernel.org> # v5.1
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f658adee
  13. 16 1月, 2020 1 次提交
    • M
      fuse: fix fuse_send_readpages() in the syncronous read case · 7df1e988
      Miklos Szeredi 提交于
      Buffered read in fuse normally goes via:
      
       -> generic_file_buffered_read()
         -> fuse_readpages()
           -> fuse_send_readpages()
             ->fuse_simple_request() [called since v5.4]
      
      In the case of a read request, fuse_simple_request() will return a
      non-negative bytecount on success or a negative error value.  A positive
      bytecount was taken to be an error and the PG_error flag set on the page.
      This resulted in generic_file_buffered_read() falling back to ->readpage(),
      which would repeat the read request and succeed.  Because of the repeated
      read succeeding the bug was not detected with regression tests or other use
      cases.
      
      The FTP module in GVFS however fails the second read due to the
      non-seekable nature of FTP downloads.
      
      Fix by checking and ignoring positive return value from
      fuse_simple_request().
      Reported-by: NOndrej Holy <oholy@redhat.com>
      Link: https://gitlab.gnome.org/GNOME/gvfs/issues/441
      Fixes: 134831e3 ("fuse: convert readpages to simple api")
      Cc: <stable@vger.kernel.org> # v5.4
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7df1e988
  14. 27 11月, 2019 1 次提交
  15. 12 11月, 2019 1 次提交
  16. 23 10月, 2019 2 次提交
  17. 24 9月, 2019 2 次提交
  18. 10 9月, 2019 6 次提交