1. 10 6月, 2019 6 次提交
  2. 06 5月, 2019 1 次提交
    • K
      vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files · 438ab720
      Kirill Smelkov 提交于
      This amends commit 10dce8af ("fs: stream_open - opener for
      stream-like files so that read and write can run simultaneously without
      deadlock") in how position is passed into .read()/.write() handler for
      stream-like files:
      
      Rasmus noticed that we currently pass 0 as position and ignore any position
      change if that is done by a file implementation. This papers over bugs if ppos
      is used in files that declare themselves as being stream-like as such bugs will
      go unnoticed. Even if a file implementation is correctly converted into using
      stream_open, its read/write later could be changed to use ppos and even though
      that won't be working correctly, that bug might go unnoticed without someone
      doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
      read/write for stream-like files as that don't give any chance for ppos usage
      bugs because it will oops if ppos is ever used inside .read() or .write().
      
      Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
      because they are called by vfs_read/vfs_write & friends before
      file_operations .read/.write .
      
      Note 2: if file backend uses new-style .read_iter/.write_iter, position
      is still passed into there as non-pointer kiocb.ki_pos . Currently
      stream_open.cocci (semantic patch added by 10dce8af) ignores files
      whose file_operations has *_iter methods.
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      438ab720
  3. 07 4月, 2019 1 次提交
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
  4. 05 3月, 2019 1 次提交
    • L
      get rid of legacy 'get_ds()' function · 736706be
      Linus Torvalds 提交于
      Every in-kernel use of this function defined it to KERNEL_DS (either as
      an actual define, or as an inline function).  It's an entirely
      historical artifact, and long long long ago used to actually read the
      segment selector valueof '%ds' on x86.
      
      Which in the kernel is always KERNEL_DS.
      
      Inspired by a patch from Jann Horn that just did this for a very small
      subset of users (the ones in fs/), along with Al who suggested a script.
      I then just took it to the logical extreme and removed all the remaining
      gunk.
      
      Roughly scripted with
      
         git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
         git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'
      
      plus manual fixups to remove a few unusual usage patterns, the couple of
      inline function cases and to fix up a comment that had become stale.
      
      The 'get_ds()' function remains in an x86 kvm selftest, since in user
      space it actually does something relevant.
      Inspired-by: NJann Horn <jannh@google.com>
      Inspired-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      736706be
  5. 22 2月, 2019 1 次提交
  6. 16 2月, 2019 1 次提交
    • A
      vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 · cc4b1242
      Aurelien Jarno 提交于
      The preadv2 and pwritev2 syscalls are supposed to emulate the readv and
      writev syscalls when offset == -1. Therefore the compat code should
      check for offset before calling do_compat_preadv64 and
      do_compat_pwritev64. This is the case for the preadv2 and pwritev2
      syscalls, but handling of offset == -1 is missing in their 64-bit
      equivalent.
      
      This patch fixes that, calling do_compat_readv and do_compat_writev when
      offset == -1. This fixes the following glibc tests on x32:
       - misc/tst-preadvwritev2
       - misc/tst-preadvwritev64v2
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cc4b1242
  7. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  8. 05 12月, 2018 1 次提交
  9. 22 11月, 2018 1 次提交
  10. 30 10月, 2018 17 次提交
  11. 18 10月, 2018 3 次提交
  12. 24 9月, 2018 1 次提交
    • A
      vfs: swap names of {do,vfs}_clone_file_range() · a725356b
      Amir Goldstein 提交于
      Commit 031a072a ("vfs: call vfs_clone_file_range() under freeze
      protection") created a wrapper do_clone_file_range() around
      vfs_clone_file_range() moving the freeze protection to former, so
      overlayfs could call the latter.
      
      The more common vfs practice is to call do_xxx helpers from vfs_xxx
      helpers, where freeze protecction is taken in the vfs_xxx helper, so
      this anomality could be a source of confusion.
      
      It seems that commit 8ede2055 ("ovl: add reflink/copyfile/dedup
      support") may have fallen a victim to this confusion -
      ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
      hope of getting freeze protection on upper fs, but in fact results in
      overlayfs allowing to bypass upper fs freeze protection.
      
      Swap the names of the two helpers to conform to common vfs practice
      and call the correct helpers from overlayfs and nfsd.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a725356b
  13. 29 8月, 2018 1 次提交
    • A
      asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macro · caf6f9c8
      Arnd Bergmann 提交于
      The sys_llseek sytem call is needed on all 32-bit architectures and
      none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
      and simplify the include/asm-generic/unistd.h header further.
      
      Since 32-bit tasks can run either natively or in compat mode on 64-bit
      architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.
      
      There are a few 64-bit architectures that also reference sys_llseek
      in their 64-bit ABI (e.g. sparc), but I verified that those all
      select CONFIG_COMPAT, so the #if check is still correct here. It's
      a bit odd to include it in the syscall table though, as it's the
      same as sys_lseek() on 64-bit, but with strange calling conventions.
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      caf6f9c8
  14. 18 7月, 2018 1 次提交
  15. 07 7月, 2018 3 次提交