1. 27 10月, 2010 1 次提交
  2. 04 10月, 2010 1 次提交
  3. 07 9月, 2010 2 次提交
    • M
      fuse: fix lock annotations · b9ca67b2
      Miklos Szeredi 提交于
      Sparse doesn't understand lock annotations of the form
      __releases(&foo->lock).  Change them to __releases(foo->lock).  Same
      for __acquires().
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      b9ca67b2
    • M
      fuse: flush background queue on connection close · 595afaf9
      Miklos Szeredi 提交于
      David Bartly reported that fuse can hang in fuse_get_req_nofail() when
      the connection to the filesystem server is no longer active.
      
      If bg_queue is not empty then flush_bg_queue() called from
      request_end() can put more requests on to the pending queue.  If this
      happens while ending requests on the processing queue then those
      background requests will be queued to the pending list and never
      ended.
      
      Another problem is that fuse_dev_release() didn't wake up processes
      sleeping on blocked_waitq.
      
      Solve this by:
      
       a) flushing the background queue before calling end_requests() on the
          pending and processing queues
      
       b) setting blocked = 0 and waking up processes waiting on
          blocked_waitq()
      
      Thanks to David for an excellent bug report.
      Reported-by: NDavid Bartley <andareed@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@kernel.org
      595afaf9
  4. 12 7月, 2010 3 次提交
    • M
      fuse: add retrieve request · 2d45ba38
      Miklos Szeredi 提交于
      Userspace filesystem can request data to be retrieved from the inode's
      mapping.  This request is synchronous and the retrieved data is queued
      as a new request.  If the write to the fuse device returns an error
      then the retrieve request was not completed and a reply will not be
      sent.
      
      Only present pages are returned in the retrieve reply.  Retrieving
      stops when it finds a non-present page and only data prior to that is
      returned.
      
      This request doesn't change the dirty state of pages.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      2d45ba38
    • M
      fuse: add store request · a1d75f25
      Miklos Szeredi 提交于
      Userspace filesystem can request data to be stored in the inode's
      mapping.  This request is synchronous and has no reply.  If the write
      to the fuse device returns an error then the store request was not
      fully completed (but may have updated some pages).
      
      If the stored data overflows the current file size, then the size is
      extended, similarly to a write(2) on the filesystem.
      
      Pages which have been completely stored are marked uptodate.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      a1d75f25
    • M
      fuse: don't use atomic kmap · 7909b1c6
      Miklos Szeredi 提交于
      Don't use atomic kmap for mapping userspace buffers in device
      read/write/splice.
      
      This is necessary because the next patch (adding store notify)
      requires that caller of fuse_copy_page() may sleep between
      invocations.  The simplest way to ensure this is to change the atomic
      kmaps to non-atomic ones.
      
      Thankfully architectures where kmap() is not a no-op are going out of
      fashion, so we can ignore the (probably negligible) performance impact
      of this change.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      7909b1c6
  5. 26 5月, 2010 1 次提交
    • K
      driver core: add devname module aliases to allow module on-demand auto-loading · 578454ff
      Kay Sievers 提交于
      This adds:
        alias: devname:<name>
      to some common kernel modules, which will allow the on-demand loading
      of the kernel module when the device node is accessed.
      
      Ideally all these modules would be compiled-in, but distros seems too
      much in love with their modularization that we need to cover the common
      cases with this new facility. It will allow us to remove a bunch of pretty
      useless init scripts and modprobes from init scripts.
      
      The static device node aliases will be carried in the module itself. The
      program depmod will extract this information to a file in the module directory:
        $ cat /lib/modules/2.6.34-00650-g537b60d1-dirty/modules.devname
        # Device nodes to trigger on-demand module loading.
        microcode cpu/microcode c10:184
        fuse fuse c10:229
        ppp_generic ppp c108:0
        tun net/tun c10:200
        dm_mod mapper/control c10:235
      
      Udev will pick up the depmod created file on startup and create all the
      static device nodes which the kernel modules specify, so that these modules
      get automatically loaded when the device node is accessed:
        $ /sbin/udevd --debug
        ...
        static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
        static_dev_create_from_modules: mknod '/dev/fuse' c10:229
        static_dev_create_from_modules: mknod '/dev/ppp' c108:0
        static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
        static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
        udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
        udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
      
      A few device nodes are switched to statically allocated numbers, to allow
      the static nodes to work. This might also useful for systems which still run
      a plain static /dev, which is completely unsafe to use with any dynamic minor
      numbers.
      
      Note:
      The devname aliases must be limited to the *common* and *single*instance*
      device nodes, like the misc devices, and never be used for conceptually limited
      systems like the loop devices, which should rather get fixed properly and get a
      control node for losetup to talk to, instead of creating a random number of
      device nodes in advance, regardless if they are ever used.
      
      This facility is to hide the mess distros are creating with too modualized
      kernels, and just to hide that these modules are not compiled-in, and not to
      paper-over broken concepts. Thanks! :)
      
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Ian Kent <raven@themaw.net>
      Signed-Off-By: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      578454ff
  6. 25 5月, 2010 4 次提交
    • M
      fuse: support splice() reading from fuse device · c3021629
      Miklos Szeredi 提交于
      Allow userspace filesystem implementation to use splice() to read from
      the fuse device.
      
      The userspace filesystem can now transfer data coming from a WRITE
      request to an arbitrary file descriptor (regular file, block device or
      socket) without having to go through a userspace buffer.
      
      The semantics of using splice() to read messages are:
      
       1)  with a single splice() call move the whole message from the fuse
           device to a temporary pipe
       2)  read the header from the pipe and determine the message type
       3a) if message is a WRITE then splice data from pipe to destination
       3b) else read rest of message to userspace buffer
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      c3021629
    • M
      fuse: allow splice to move pages · ce534fb0
      Miklos Szeredi 提交于
      When splicing buffers to the fuse device with SPLICE_F_MOVE, try to
      move pages from the pipe buffer into the page cache.  This allows
      populating the fuse filesystem's cache without ever touching the page
      contents, i.e. zero copy read capability.
      
      The following steps are performed when trying to move a page into the
      page cache:
      
       - buf->ops->confirm() to make sure the new page is uptodate
       - buf->ops->steal() to try to remove the new page from it's previous place
       - remove_from_page_cache() on the old page
       - add_to_page_cache_locked() on the new page
      
      If any of the above steps fail (non fatally) then the code falls back
      to copying the page.  In particular ->steal() will fail if there are
      external references (other than the page cache and the pipe buffer) to
      the page.
      
      Also since the remove_from_page_cache() + add_to_page_cache_locked()
      are non-atomic it is possible that the page cache is repopulated in
      between the two and add_to_page_cache_locked() will fail.  This could
      be fixed by creating a new atomic replace_page_cache_page() function.
      
      fuse_readpages_end() needed to be reworked so it works even if
      page->mapping is NULL for some or all pages which can happen if the
      add_to_page_cache_locked() failed.
      
      A number of sanity checks were added to make sure the stolen pages
      don't have weird flags set, etc...  These could be moved into generic
      splice/steal code.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      ce534fb0
    • M
      fuse: support splice() writing to fuse device · dd3bb14f
      Miklos Szeredi 提交于
      Allow userspace filesystem implementation to use splice() to write to
      the fuse device.  The semantics of using splice() are:
      
       1) buffer the message header and data in a temporary pipe
       2) with a *single* splice() call move the message from the temporary pipe
          to the fuse device
      
      The READ reply message has the most interesting use for this, since
      now the data from an arbitrary file descriptor (which could be a
      regular file, a block device or a socket) can be tranferred into the
      fuse device without having to go through a userspace buffer.  It will
      also allow zero copy moving of pages.
      
      One caveat is that the protocol on the fuse device requires the length
      of the whole message to be written into the header.  But the length of
      the data transferred into the temporary pipe may not be known in
      advance.  The current library implementation works around this by
      using vmplice to write the header and modifying the header after
      splicing the data into the pipe (error handling omitted):
      
      	struct fuse_out_header out;
      
      	iov.iov_base = &out;
      	iov.iov_len = sizeof(struct fuse_out_header);
      	vmsplice(pip[1], &iov, 1, 0);
      	len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
      	/* retrospectively modify the header: */
      	out.len = len + sizeof(struct fuse_out_header);
      	splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);
      
      This works since vmsplice only saves a pointer to the data, it does
      not copy the data itself.
      
      Since pipes are currently limited to 16 pages and messages need to be
      spliced atomically, the length of the data is limited to 15 pages (or
      60kB for 4k pages).
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      dd3bb14f
    • M
      fuse: use get_user_pages_fast() · 1bf94ca7
      Miklos Szeredi 提交于
      Replace uses of get_user_pages() with get_user_pages_fast().  It looks
      nicer and should be faster in most cases.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      1bf94ca7
  7. 05 2月, 2010 2 次提交
  8. 12 7月, 2009 1 次提交
  9. 11 7月, 2009 2 次提交
  10. 07 7月, 2009 1 次提交
  11. 01 7月, 2009 2 次提交
    • J
      fuse: invalidation reverse calls · 3b463ae0
      John Muir 提交于
      Add notification messages that allow the filesystem to invalidate VFS
      caches.
      
      Two notifications are added:
      
       1) inode invalidation
      
         - invalidate cached attributes
         - invalidate a range of pages in the page cache (this is optional)
      
       2) dentry invalidation
      
         - try to invalidate a subtree in the dentry cache
      
      Care must be taken while accessing the 'struct super_block' for the
      mount, as it can go away while an invalidation is in progress.  To
      prevent this, introduce a rw-semaphore, that is taken for read during
      the invalidation and taken for write in the ->kill_sb callback.
      
      Cc: Csaba Henk <csaba@gluster.com>
      Cc: Anand Avati <avati@zresearch.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      3b463ae0
    • C
      fuse: fix return value of fuse_dev_write() · b4c458b3
      Csaba Henk 提交于
      On 64 bit systems -- where sizeof(ssize_t) > sizeof(int) -- the following test
      exposes a bug due to a non-careful return of an int or unsigned value:
      
      implement a FUSE filesystem which sends an unsolicited notification to
      the kernel with invalid opcode. The respective write to /dev/fuse
      will return (1 << 32) - EINVAL with errno == 0 instead of -1 with
      errno == EINVAL.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@kernel.org
      b4c458b3
  12. 28 4月, 2009 2 次提交
  13. 26 1月, 2009 2 次提交
    • M
      fuse: fix poll notify · f6d47a17
      Miklos Szeredi 提交于
      Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
      This is not a big issue because fuse_notify_poll_wakeup() should be
      atomic, but it's cleaner this way, and later uses of notification will
      need to be able to finish the copying before performing some actions.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      f6d47a17
    • M
      fuse: destroy bdi on umount · 26c36791
      Miklos Szeredi 提交于
      If a fuse filesystem is unmounted but the device file descriptor
      remains open and a new mount reuses the old device number, then the
      mount fails with EEXIST and the following warning is printed in the
      kernel log:
      
        WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
        sysfs: duplicate filename '0:15' can not be created
      
      The cause is that the bdi belonging to the fuse filesystem was
      destoryed only after the device file was released.  Fix this by
      calling bdi_destroy() from fuse_put_super() instead.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@kernel.org
      26c36791
  14. 02 12月, 2008 1 次提交
  15. 26 11月, 2008 5 次提交
  16. 14 11月, 2008 1 次提交
  17. 02 11月, 2008 1 次提交
    • A
      saner FASYNC handling on file close · 233e70f4
      Al Viro 提交于
      As it is, all instances of ->release() for files that have ->fasync()
      need to remember to evict file from fasync lists; forgetting that
      creates a hole and we actually have a bunch that *does* forget.
      
      So let's keep our lives simple - let __fput() check FASYNC in
      file->f_flags and call ->fasync() there if it's been set.  And lose that
      crap in ->release() instances - leaving it there is still valid, but we
      don't have to bother anymore.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      233e70f4
  18. 30 4月, 2008 2 次提交
    • M
      fuse: fix sparse warnings · 4dbf930e
      Miklos Szeredi 提交于
      fs/fuse/dev.c:306:2: warning: context imbalance in 'wait_answer_interruptible' - unexpected unlock
      fs/fuse/dev.c:361:2: warning: context imbalance in 'request_wait_answer' - unexpected unlock
      fs/fuse/dev.c:1002:4: warning: context imbalance in 'end_io_requests' - unexpected unlock
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4dbf930e
    • M
      fuse: support writable mmap · 3be5a52b
      Miklos Szeredi 提交于
      Quoting Linus (3 years ago, FUSE inclusion discussions):
      
        "User-space filesystems are hard to get right. I'd claim that they
         are almost impossible, unless you limit them somehow (shared
         writable mappings are the nastiest part - if you don't have those,
         you can reasonably limit your problems by limiting the number of
         dirty pages you accept through normal "write()" calls)."
      
      Instead of attempting the impossible, I've just waited for the dirty page
      accounting infrastructure to materialize (thanks to Peter Zijlstra and
      others).  This nicely solved the biggest problem: limiting the number of pages
      used for write caching.
      
      Some small details remained, however, which this largish patch attempts to
      address.  It provides a page writeback implementation for fuse, which is
      completely safe against VM related deadlocks.  Performance may not be very
      good for certain usage patterns, but generally it should be acceptable.
      
      It has been tested extensively with fsx-linux and bash-shared-mapping.
      
      Fuse page writeback design
      --------------------------
      
      fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
      It copies the contents of the original page, and queues a WRITE request to the
      userspace filesystem using this temp page.
      
      The writeback is finished instantly from the MM's point of view: the page is
      removed from the radix trees, and the PageDirty and PageWriteback flags are
      cleared.
      
      For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
      incremented.  The per-bdi writeback count is not decremented until the actual
      write completes.
      
      On dirtying the page, fuse waits for a previous write to finish before
      proceeding.  This makes sure, there can only be one temporary page used at a
      time for one cached page.
      
      This approach is wasteful in both memory and CPU bandwidth, so why is this
      complication needed?
      
      The basic problem is that there can be no guarantee about the time in which
      the userspace filesystem will complete a write.  It may be buggy or even
      malicious, and fail to complete WRITE requests.  We don't want unrelated parts
      of the system to grind to a halt in such cases.
      
      Also a filesystem may need additional resources (particularly memory) to
      complete a WRITE request.  There's a great danger of a deadlock if that
      allocation may wait for the writepage to finish.
      
      Currently there are several cases where the kernel can block on page
      writeback:
      
        - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
        - page migration
        - throttle_vm_writeout (through NR_WRITEBACK)
        - sync(2)
      
      Of course in some cases (fsync, msync) we explicitly want to allow blocking.
      So for these cases new code has to be added to fuse, since the VM is not
      tracking writeback pages for us any more.
      
      As an extra safetly measure, the maximum dirty ratio allocated to a single
      fuse filesystem is set to 1% by default.  This way one (or several) buggy or
      malicious fuse filesystems cannot slow down the rest of the system by hogging
      dirty memory.
      
      With appropriate privileges, this limit can be raised through
      '/sys/class/bdi/<bdi>/max_ratio'.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3be5a52b
  19. 07 2月, 2008 1 次提交
  20. 17 10月, 2007 5 次提交