1. 16 11月, 2019 1 次提交
    • D
      pipe: Allow pipes to have kernel-reserved slots · 6718b6f8
      David Howells 提交于
      Split pipe->ring_size into two numbers:
      
       (1) pipe->ring_size - indicates the hard size of the pipe ring.
      
       (2) pipe->max_usage - indicates the maximum number of pipe ring slots that
           userspace orchestrated events can fill.
      
      This allows for a pipe that is both writable by the general kernel
      notification facility and by userspace, allowing plenty of ring space for
      notifications to be added whilst preventing userspace from being able to
      pin too much unswappable kernel space.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6718b6f8
  2. 31 10月, 2019 1 次提交
    • D
      pipe: Use head and tail pointers for the ring, not cursor and length · 8cefc107
      David Howells 提交于
      Convert pipes to use head and tail pointers for the buffer ring rather than
      pointer and length as the latter requires two atomic ops to update (or a
      combined op) whereas the former only requires one.
      
       (1) The head pointer is the point at which production occurs and points to
           the slot in which the next buffer will be placed.  This is equivalent
           to pipe->curbuf + pipe->nrbufs.
      
           The head pointer belongs to the write-side.
      
       (2) The tail pointer is the point at which consumption occurs.  It points
           to the next slot to be consumed.  This is equivalent to pipe->curbuf.
      
           The tail pointer belongs to the read-side.
      
       (3) head and tail are allowed to run to UINT_MAX and wrap naturally.  They
           are only masked off when the array is being accessed, e.g.:
      
      	pipe->bufs[head & mask]
      
           This means that it is not necessary to have a dead slot in the ring as
           head == tail isn't ambiguous.
      
       (4) The ring is empty if "head == tail".
      
           A helper, pipe_empty(), is provided for this.
      
       (5) The occupancy of the ring is "head - tail".
      
           A helper, pipe_occupancy(), is provided for this.
      
       (6) The number of free slots in the ring is "pipe->ring_size - occupancy".
      
           A helper, pipe_space_for_user() is provided to indicate how many slots
           userspace may use.
      
       (7) The ring is full if "head - tail >= pipe->ring_size".
      
           A helper, pipe_full(), is provided for this.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8cefc107
  3. 24 9月, 2019 2 次提交
  4. 12 9月, 2019 6 次提交
  5. 10 9月, 2019 17 次提交
    • M
      fuse: stop copying pages to fuse_req · 05ea48cc
      Miklos Szeredi 提交于
      The page array pointers are also duplicated across fuse_args_pages and
      fuse_req.  Get rid of the fuse_req ones.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      05ea48cc
    • M
      fuse: stop copying args to fuse_req · d4993774
      Miklos Szeredi 提交于
      No need to duplicate the argument arrays in fuse_req, so just dereference
      req->args instead of copying to the fuse_req internal ones.
      
      This allows further cleanup of the fuse_req structure.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d4993774
    • M
      fuse: simplify request allocation · 7213394c
      Miklos Szeredi 提交于
      Page arrays are not allocated together with the request anymore.  Get rid
      of the dead code
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7213394c
    • M
      fuse: unexport request ops · 66abc359
      Miklos Szeredi 提交于
      All requests are now sent with one of the fuse_simple_... helpers.  Get rid
      of the old api from the fuse internal header.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      66abc359
    • M
      fuse: convert retrieve to simple api · 75b399dd
      Miklos Szeredi 提交于
      Rename fuse_request_send_notify_reply() to fuse_simple_notify_reply() and
      convert to passing fuse_args instead of fuse_req.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      75b399dd
    • M
      fuse: convert writepages to simple api · 33826ebb
      Miklos Szeredi 提交于
      Derive fuse_writepage_args from fuse_io_args.
      
      Sending the request is tricky since it was done with fi->lock held, hence
      we must either use atomic allocation or release the lock.  Both are
      possible so try atomic first and if it fails, release the lock and do the
      regular allocation with GFP_NOFS and __GFP_NOFAIL.  Both flags are
      necessary for correct operation.
      
      Move the page realloc function from dev.c to file.c and convert to using
      fuse_writepage_args.
      
      The last caller of fuse_write_fill() is gone, so get rid of it.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      33826ebb
    • M
      fuse: add simple background helper · 12597287
      Miklos Szeredi 提交于
      Create a helper named fuse_simple_background() that is similar to
      fuse_simple_request().  Unlike the latter, it returns immediately and calls
      the supplied 'end' callback when the reply is received.
      
      The supplied 'args' pointer is stored in 'fuse_req' which allows the
      callback to interpret the output arguments decoded from the reply.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      12597287
    • M
      fuse: convert ioctl to simple api · 093f38a2
      Miklos Szeredi 提交于
      fuse_simple_request() is converted to return length of last (instead of
      single) out arg, since FUSE_IOCTL_OUT has two out args, the second of which
      is variable length.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      093f38a2
    • M
      fuse: move page alloc · 4c4f03f7
      Miklos Szeredi 提交于
      fuse_req_pages_alloc() is moved to file.c, since its internal use by the
      device code will eventually be removed.
      
      Rename to fuse_pages_alloc() to signify that it's not only usable for
      fuse_req page array.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      4c4f03f7
    • M
      fuse: add pages to fuse_args · 68583165
      Miklos Szeredi 提交于
      Derive fuse_args_pages from fuse_args. This is used to handle requests
      which use pages for input or output.  The related flags are added to
      fuse_args.
      
      New FR_ALLOC_PAGES flags is added to indicate whether the page arrays in
      fuse_req need to be freed by fuse_put_request() or not.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      68583165
    • M
      fuse: add nocreds to fuse_args · e413754b
      Miklos Szeredi 提交于
      In some cases it makes no sense to set pid/uid/gid fields in the request
      header.  Allow fuse_simple_background() to omit these.  This is only
      required in the "force" case, so for now just WARN if set otherwise.
      
      Fold fuse_get_req_nofail_nopages() into its only caller.  Comment is
      obsolete anyway.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      e413754b
    • M
      fuse: convert fuse_force_forget() to simple api · 3545fe21
      Miklos Szeredi 提交于
      Move this function to the readdir.c where its only caller resides.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      3545fe21
    • M
      fuse: add noreply to fuse_args · 454a7613
      Miklos Szeredi 提交于
      This will be used by fuse_force_forget().
      
      We can expand fuse_request_send() into fuse_simple_request().  The
      FR_WAITING bit has already been set, no need to check.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      454a7613
    • M
      fuse: convert flush to simple api · c500ebaa
      Miklos Szeredi 提交于
      Add 'force' to fuse_args and use fuse_get_req_nofail_nopages() to allocate
      the request in that case.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c500ebaa
    • M
      fuse: simplify 'nofail' request · 40ac7ab2
      Miklos Szeredi 提交于
      Instead of complex games with a reserved request, just use __GFP_NOFAIL.
      
      Both calers (flush, readdir) guarantee that connection was already
      initialized, so no need to wait for fc->initialized.
      
      Also remove unneeded clearing of FR_BACKGROUND flag.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      40ac7ab2
    • M
      fuse: flatten 'struct fuse_args' · d5b48543
      Miklos Szeredi 提交于
      ...to make future expansion simpler.  The hiearachical structure is a
      historical thing that does not serve any practical purpose.
      
      The generated code is excatly the same before and after the patch.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d5b48543
    • E
      fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock · 76e43c8c
      Eric Biggers 提交于
      When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
      and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
      
      This may have to wait for fuse_iqueue::waitq.lock to be released by one
      of many places that take it with IRQs enabled.  Since the IRQ handler
      may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
      
      Fix it by protecting the state of struct fuse_iqueue with a separate
      spinlock, and only accessing fuse_iqueue::waitq using the versions of
      the waitqueue functions which do IRQ-safe locking internally.
      
      Reproducer:
      
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <sys/mount.h>
      	#include <sys/stat.h>
      	#include <sys/syscall.h>
      	#include <unistd.h>
      	#include <linux/aio_abi.h>
      
      	int main()
      	{
      		char opts[128];
      		int fd = open("/dev/fuse", O_RDWR);
      		aio_context_t ctx = 0;
      		struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
      		struct iocb *cbp = &cb;
      
      		sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
      		mkdir("mnt", 0700);
      		mount("foo",  "mnt", "fuse", 0, opts);
      		syscall(__NR_io_setup, 1, &ctx);
      		syscall(__NR_io_submit, ctx, 1, &cbp);
      	}
      
      Beginning of lockdep output:
      
      	=====================================================
      	WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
      	5.3.0-rc5 #9 Not tainted
      	-----------------------------------------------------
      	syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      	000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
      
      	and this task is already holding:
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
      	which would create a new lock dependency:
      	 (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
      
      	but this new dependency connects a SOFTIRQ-irq-safe lock:
      	 (&(&ctx->ctx_lock)->rlock){..-.}
      
      	[...]
      
      Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
      Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Cc: <stable@vger.kernel.org> # v4.19+
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      76e43c8c
  6. 02 9月, 2019 1 次提交
    • K
      fuse: require /dev/fuse reads to have enough buffer capacity (take 2) · 1fb027d7
      Kirill Smelkov 提交于
      [ This retries commit d4b13963 ("fuse: require /dev/fuse reads to have
      enough buffer capacity"), which was reverted.  In this version we require
      only `sizeof(fuse_in_header) + sizeof(fuse_write_in)` instead of 4K for
      FUSE request header room, because, contrary to libfuse and kernel client
      behaviour, GlusterFS actually provides only so much room for request
      header. ]
      
      A FUSE filesystem server queues /dev/fuse sys_read calls to get filesystem
      requests to handle. It does not know in advance what would be that request
      as it can be anything that client issues - LOOKUP, READ, WRITE, ... Many
      requests are short and retrieve data from the filesystem. However WRITE and
      NOTIFY_REPLY write data into filesystem.
      
      Before getting into operation phase, FUSE filesystem server and kernel
      client negotiate what should be the maximum write size the client will ever
      issue. After negotiation the contract in between server/client is that the
      filesystem server then should queue /dev/fuse sys_read calls with enough
      buffer capacity to receive any client request - WRITE in particular, while
      FUSE client should not, in particular, send WRITE requests with >
      negotiated max_write payload. FUSE client in kernel and libfuse
      historically reserve 4K for request header. However an existing filesystem
      server - GlusterFS - was found which reserves only 80 bytes for header room
      (= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`).
      
      Since
      
      	`sizeof(fuse_in_header) + sizeof(fuse_write_in)` ==
      	`sizeof(fuse_in_header) + sizeof(fuse_read_in)`  ==
      	`sizeof(fuse_in_header) + sizeof(fuse_notify_retrieve_in)`
      
      is the absolute minimum any sane filesystem should be using for header
      room, the contract is that filesystem server should queue sys_reads with
      `sizeof(fuse_in_header) + sizeof(fuse_write_in)` + max_write buffer.
      
      If the filesystem server does not follow this contract, what can happen
      is that fuse_dev_do_read will see that request size is > buffer size,
      and then it will return EIO to client who issued the request but won't
      indicate in any way that there is a problem to filesystem server.
      This can be hard to diagnose because for some requests, e.g. for
      NOTIFY_REPLY which mimics WRITE, there is no client thread that is
      waiting for request completion and that EIO goes nowhere, while on
      filesystem server side things look like the kernel is not replying back
      after successful NOTIFY_RETRIEVE request made by the server.
      
      We can make the problem easy to diagnose if we indicate via error return to
      filesystem server when it is violating the contract.  This should not
      practically cause problems because if a filesystem server is using shorter
      buffer, writes to it were already very likely to cause EIO, and if the
      filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
      minimum buffer size.
      
      Please see [1] for context where the problem of stuck filesystem was hit
      for real (because kernel client was incorrectly sending more than
      max_write data with NOTIFY_REPLY; see also previous patch), how the
      situation was traced and for more involving patch that did not make it
      into the tree.
      
      [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Tested-by: NSander Eikelenboom <linux@eikelenboom.it>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      1fb027d7
  7. 11 6月, 2019 1 次提交
  8. 24 4月, 2019 3 次提交
    • K
      fuse: require /dev/fuse reads to have enough buffer capacity · d4b13963
      Kirill Smelkov 提交于
      A FUSE filesystem server queues /dev/fuse sys_read calls to get
      filesystem requests to handle. It does not know in advance what would be
      that request as it can be anything that client issues - LOOKUP, READ,
      WRITE, ... Many requests are short and retrieve data from the
      filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.
      
      Before getting into operation phase, FUSE filesystem server and kernel
      client negotiate what should be the maximum write size the client will
      ever issue. After negotiation the contract in between server/client is
      that the filesystem server then should queue /dev/fuse sys_read calls with
      enough buffer capacity to receive any client request - WRITE in
      particular, while FUSE client should not, in particular, send WRITE
      requests with > negotiated max_write payload. FUSE client in kernel and
      libfuse historically reserve 4K for request header. This way the
      contract is that filesystem server should queue sys_reads with
      4K+max_write buffer.
      
      If the filesystem server does not follow this contract, what can happen
      is that fuse_dev_do_read will see that request size is > buffer size,
      and then it will return EIO to client who issued the request but won't
      indicate in any way that there is a problem to filesystem server.
      This can be hard to diagnose because for some requests, e.g. for
      NOTIFY_REPLY which mimics WRITE, there is no client thread that is
      waiting for request completion and that EIO goes nowhere, while on
      filesystem server side things look like the kernel is not replying back
      after successful NOTIFY_RETRIEVE request made by the server.
      
      We can make the problem easy to diagnose if we indicate via error return to
      filesystem server when it is violating the contract.  This should not
      practically cause problems because if a filesystem server is using shorter
      buffer, writes to it were already very likely to cause EIO, and if the
      filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
      minimum buffer size.
      
      Please see [1] for context where the problem of stuck filesystem was hit
      for real (because kernel client was incorrectly sending more than
      max_write data with NOTIFY_REPLY; see also previous patch), how the
      situation was traced and for more involving patch that did not make it
      into the tree.
      
      [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d4b13963
    • K
      fuse: retrieve: cap requested size to negotiated max_write · 7640682e
      Kirill Smelkov 提交于
      FUSE filesystem server and kernel client negotiate during initialization
      phase, what should be the maximum write size the client will ever issue.
      Correspondingly the filesystem server then queues sys_read calls to read
      requests with buffer capacity large enough to carry request header + that
      max_write bytes. A filesystem server is free to set its max_write in
      anywhere in the range between [1*page, fc->max_pages*page]. In particular
      go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages
      corresponds to 128K. Libfuse also allows users to configure max_write, but
      by default presets it to possible maximum.
      
      If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we
      allow to retrieve more than max_write bytes, corresponding prepared
      NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
      filesystem server, in full correspondence with server/client contract, will
      be only queuing sys_read with ~max_write buffer capacity, and
      fuse_dev_do_read throws away requests that cannot fit into server request
      buffer. In turn the filesystem server could get stuck waiting indefinitely
      for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is
      understood by clients as that NOTIFY_REPLY was queued and will be sent
      back.
      
      Cap requested size to negotiate max_write to avoid the problem.  This
      aligns with the way NOTIFY_RETRIEVE handler works, which already
      unconditionally caps requested retrieve size to fuse_conn->max_pages.  This
      way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than
      was originally requested.
      
      Please see [1] for context where the problem of stuck filesystem was hit
      for real, how the situation was traced and for more involving patch that
      did not make it into the tree.
      
      [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
      [2] https://github.com/hanwen/go-fuseSigned-off-by: NKirill Smelkov <kirr@nexedi.com>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7640682e
    • K
      fuse: convert printk -> pr_* · f2294482
      Kirill Smelkov 提交于
      Functions, like pr_err, are a more modern variant of printing compared to
      printk. They could be used to denoise sources by using needed level in
      the print function name, and by automatically inserting per-driver /
      function / ... print prefix as defined by pr_fmt macro. pr_* are also
      said to be used in Documentation/process/coding-style.rst and more
      recent code - for example overlayfs - uses them instead of printk.
      
      Convert CUSE and FUSE to use the new pr_* functions.
      
      CUSE output stays completely unchanged, while FUSE output is amended a
      bit for "trying to steal weird page" warning - the second line now comes
      also with "fuse:" prefix. I hope it is ok.
      Suggested-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f2294482
  9. 15 4月, 2019 1 次提交
  10. 13 2月, 2019 7 次提交