1. 26 4月, 2019 10 次提交
  2. 20 4月, 2019 1 次提交
    • A
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · 04f5866e
      Andrea Arcangeli 提交于
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Acked-by: NJason Gunthorpe <jgg@mellanox.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f5866e
  3. 18 4月, 2019 1 次提交
    • J
      io_uring: fix CQ overflow condition · 74f464e9
      Jens Axboe 提交于
      This is a leftover from when the rings initially were not free flowing,
      and hence a test for tail + 1 == head would indicate full. Since we now
      let them wrap instead of mask them with the size, we need to check if
      they drift more than the ring size from each other.
      
      This fixes a case where we'd overwrite CQ ring entries, if the user
      failed to reap completions. Both cases would ultimately result in lost
      completions as the application violated the depth it asked for. The only
      difference is that before this fix we'd return invalid entries for the
      overflowed completions, instead of properly flagging it in the
      cq_ring->overflow variable.
      Reported-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74f464e9
  4. 16 4月, 2019 6 次提交
    • A
      CIFS: keep FileInfo handle live during oplock break · b98749ca
      Aurelien Aptel 提交于
      In the oplock break handler, writing pending changes from pages puts
      the FileInfo handle. If the refcount reaches zero it closes the handle
      and waits for any oplock break handler to return, thus causing a deadlock.
      
      To prevent this situation:
      
      * We add a wait flag to cifsFileInfo_put() to decide whether we should
        wait for running/pending oplock break handlers
      
      * We keep an additionnal reference of the SMB FileInfo handle so that
        for the rest of the handler putting the handle won't close it.
        - The ref is bumped everytime we queue the handler via the
          cifs_queue_oplock_break() helper.
        - The ref is decremented at the end of the handler
      
      This bug was triggered by xfstest 464.
      
      Also important fix to address the various reports of
      oops in smb2_push_mandatory_locks
      Signed-off-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      b98749ca
    • R
      cifs: fix handle leak in smb2_query_symlink() · e6d0fb7b
      Ronnie Sahlberg 提交于
      If we enter smb2_query_symlink() for something that is not a symlink
      and where the SMB2_open() would succeed we would never end up
      closing this handle and would thus leak a handle on the server.
      
      Fix this by immediately calling SMB2_close() on successfull open.
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      e6d0fb7b
    • Z
      cifs: Fix lease buffer length error · b57a55e2
      ZhangXiaoxu 提交于
      There is a KASAN slab-out-of-bounds:
      BUG: KASAN: slab-out-of-bounds in _copy_from_iter_full+0x783/0xaa0
      Read of size 80 at addr ffff88810c35e180 by task mount.cifs/539
      
      CPU: 1 PID: 539 Comm: mount.cifs Not tainted 4.19 #10
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
                  rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
      Call Trace:
       dump_stack+0xdd/0x12a
       print_address_description+0xa7/0x540
       kasan_report+0x1ff/0x550
       check_memory_region+0x2f1/0x310
       memcpy+0x2f/0x80
       _copy_from_iter_full+0x783/0xaa0
       tcp_sendmsg_locked+0x1840/0x4140
       tcp_sendmsg+0x37/0x60
       inet_sendmsg+0x18c/0x490
       sock_sendmsg+0xae/0x130
       smb_send_kvec+0x29c/0x520
       __smb_send_rqst+0x3ef/0xc60
       smb_send_rqst+0x25a/0x2e0
       compound_send_recv+0x9e8/0x2af0
       cifs_send_recv+0x24/0x30
       SMB2_open+0x35e/0x1620
       open_shroot+0x27b/0x490
       smb2_open_op_close+0x4e1/0x590
       smb2_query_path_info+0x2ac/0x650
       cifs_get_inode_info+0x1058/0x28f0
       cifs_root_iget+0x3bb/0xf80
       cifs_smb3_do_mount+0xe00/0x14c0
       cifs_do_mount+0x15/0x20
       mount_fs+0x5e/0x290
       vfs_kern_mount+0x88/0x460
       do_mount+0x398/0x31e0
       ksys_mount+0xc6/0x150
       __x64_sys_mount+0xea/0x190
       do_syscall_64+0x122/0x590
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It can be reproduced by the following step:
        1. samba configured with: server max protocol = SMB2_10
        2. mount -o vers=default
      
      When parse the mount version parameter, the 'ops' and 'vals'
      was setted to smb30,  if negotiate result is smb21, just
      update the 'ops' to smb21, but the 'vals' is still smb30.
      When add lease context, the iov_base is allocated with smb21
      ops, but the iov_len is initiallited with the smb30. Because
      the iov_len is longer than iov_base, when send the message,
      copy array out of bounds.
      
      we need to keep the 'ops' and 'vals' consistent.
      
      Fixes: 9764c02f ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)")
      Fixes: d5c7076b ("smb3: add smb3.1.1 to default dialect list")
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      b57a55e2
    • Z
      cifs: Fix use-after-free in SMB2_read · 088aaf17
      ZhangXiaoxu 提交于
      There is a KASAN use-after-free:
      BUG: KASAN: use-after-free in SMB2_read+0x1136/0x1190
      Read of size 8 at addr ffff8880b4e45e50 by task ln/1009
      
      Should not release the 'req' because it will use in the trace.
      
      Fixes: eccb4422 ("smb3: Add ftrace tracepoints for improved SMB3 debugging")
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org> 4.18+
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      088aaf17
    • Z
      cifs: Fix use-after-free in SMB2_write · 6a3eb336
      ZhangXiaoxu 提交于
      There is a KASAN use-after-free:
      BUG: KASAN: use-after-free in SMB2_write+0x1342/0x1580
      Read of size 8 at addr ffff8880b6a8e450 by task ln/4196
      
      Should not release the 'req' because it will use in the trace.
      
      Fixes: eccb4422 ("smb3: Add ftrace tracepoints for improved SMB3 debugging")
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org> 4.18+
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      6a3eb336
    • J
      io_uring: fix possible deadlock between io_uring_{enter,register} · b19062a5
      Jens Axboe 提交于
      If we have multiple threads, one doing io_uring_enter() while the other
      is doing io_uring_register(), we can run into a deadlock between the
      two. io_uring_register() must wait for existing users of the io_uring
      instance to exit. But it does so while holding the io_uring mutex.
      Callers of io_uring_enter() may need this mutex to make progress (and
      eventually exit). If we wait for users to exit in io_uring_register(),
      we can't do so with the io_uring mutex held without potentially risking
      a deadlock.
      
      Drop the io_uring mutex while waiting for existing callers to exit. This
      is safe and guaranteed to make forward progress, since we already killed
      the percpu ref before doing so. Hence later callers of io_uring_enter()
      will be rejected.
      
      Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b19062a5
  5. 15 4月, 2019 1 次提交
  6. 14 4月, 2019 3 次提交
    • J
      io_uring: drop io_file_put() 'file' argument · 3d6770fb
      Jens Axboe 提交于
      Since the fget/fput handling was reworked in commit 09bb8394, we
      never call io_file_put() with state == NULL (and hence file != NULL)
      anymore. Remove that case.
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d6770fb
    • J
      io_uring: only test SQPOLL cpu after we've verified it · 917257da
      Jens Axboe 提交于
      We currently call cpu_possible() even if we don't use the CPU. Move the
      test under the SQ_AFF branch, which is the only place where we'll use
      the value. Do the cpu_possible() test AFTER we've limited it to a max
      of NR_CPUS. This avoids triggering the following warning:
      
      WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn
      
      if CONFIG_DEBUG_PER_CPU_MAPS is enabled.
      
      While in there, also move the SQ thread idle period assignment inside
      SETUP_SQPOLL, as we don't use it otherwise either.
      
      Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      917257da
    • J
      io_uring: park SQPOLL thread if it's percpu · 06058632
      Jens Axboe 提交于
      kthread expects this, or we can throw a warning on exit:
      
      WARNING: CPU: 0 PID: 7822 at kernel/kthread.c:399
      __kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 7822 Comm: syz-executor030 Not tainted 5.1.0-rc4-next-20190412
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x172/0x1f0 lib/dump_stack.c:113
        panic+0x2cb/0x72b kernel/panic.c:214
        __warn.cold+0x20/0x46 kernel/panic.c:576
        report_bug+0x263/0x2b0 lib/bug.c:186
        fixup_bug arch/x86/kernel/traps.c:179 [inline]
        fixup_bug arch/x86/kernel/traps.c:174 [inline]
        do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
        do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
        invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
      Code: 48 89 fb e8 f7 ab 24 00 4c 89 e6 48 89 df e8 ac e1 02 00 31 ff 49 89
      c4 48 89 c6 e8 7f ad 24 00 4d 85 e4 75 15 e8 d5 ab 24 00 <0f> 0b e8 ce ab
      24 00 5b 41 5c 41 5d 41 5e 5d c3 e8 c0 ab 24 00 4c
      RSP: 0018:ffff8880a89bfbb8 EFLAGS: 00010293
      RAX: ffff88808ca7a280 RBX: ffff8880a98e4380 RCX: ffffffff814bdd11
      RDX: 0000000000000000 RSI: ffffffff814bdd1b RDI: 0000000000000007
      RBP: ffff8880a89bfbd8 R08: ffff88808ca7a280 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffffffff87691148 R14: ffff8880a98e43a0 R15: ffffffff81c91e10
        __kthread_bind kernel/kthread.c:412 [inline]
        kthread_unpark+0x123/0x160 kernel/kthread.c:480
        kthread_stop+0xfa/0x6c0 kernel/kthread.c:556
        io_sq_thread_stop fs/io_uring.c:2057 [inline]
        io_sq_thread_stop fs/io_uring.c:2052 [inline]
        io_finish_async+0xab/0x180 fs/io_uring.c:2064
        io_ring_ctx_free fs/io_uring.c:2534 [inline]
        io_ring_ctx_wait_and_kill+0x133/0x510 fs/io_uring.c:2591
        io_uring_release+0x42/0x50 fs/io_uring.c:2599
        __fput+0x2e5/0x8d0 fs/file_table.c:278
        ____fput+0x16/0x20 fs/file_table.c:309
        task_work_run+0x14a/0x1c0 kernel/task_work.c:113
        exit_task_work include/linux/task_work.h:22 [inline]
        do_exit+0x90a/0x2fa0 kernel/exit.c:876
        do_group_exit+0x135/0x370 kernel/exit.c:980
        __do_sys_exit_group kernel/exit.c:991 [inline]
        __se_sys_exit_group kernel/exit.c:989 [inline]
        __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
        do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reported-by: syzbot+6d4a92619eb0ad08602b@syzkaller.appspotmail.com
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06058632
  7. 13 4月, 2019 7 次提交
    • D
      afs: Fix in-progess ops to ignore server-level callback invalidation · eeba1e9c
      David Howells 提交于
      The in-kernel afs filesystem client counts the number of server-level
      callback invalidation events (CB.InitCallBackState* RPC operations) that it
      receives from the server.  This is stored in cb_s_break in various
      structures, including afs_server and afs_vnode.
      
      If an inode is examined by afs_validate(), say, the afs_server copy is
      compared, along with other break counters, to those in afs_vnode, and if
      one or more of the counters do not match, it is considered that the
      server's callback promise is broken.  At points where this happens,
      AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
      refetched from the server.
      
      afs_validate() issues an FS.FetchStatus operation to get updated metadata -
      and based on the updated data_version may invalidate the pagecache too.
      
      However, the break counters are also used to determine whether to note a
      new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
      and whether to cache the permit data included in the YFSFetchStatus record
      by the server.
      
      
      The problem comes when the server sends us a CB.InitCallBackState op.  The
      first such instance doesn't cause cb_s_break to be incremented, but rather
      causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
      after last use and all the volumes have been automatically unmounted and
      the server has forgotten about the client[*], this *will* likely cause an
      increment.
      
       [*] There are other circumstances too, such as the server restarting or
           needing to make space in its callback table.
      
      Note that the server won't send us a CB.InitCallBackState op until we talk
      to it again.
      
      So what happens is:
      
       (1) A mount for a new volume is attempted, a inode is created for the root
           vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
           immediately, as we don't have a nominated server to talk to yet - and
           we may iterate through a few to find one.
      
       (2) Before the operation happens, afs_fetch_status(), say, notes in the
           cursor (fc.cb_break) the break counter sum from the vnode, volume and
           server counters, but the server->cb_s_break is currently 0.
      
       (3) We send FS.FetchStatus to the server.  The server sends us back
           CB.InitCallBackState.  We increment server->cb_s_break.
      
       (4) Our FS.FetchStatus completes.  The reply includes a callback record.
      
       (5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
           the callback promise was broken by checking the break counter sum from
           step (2) against the current sum.
      
           This fails because of step (3), so we don't set the callback record
           and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.
      
      This does not preclude the syscall from progressing, and we don't loop here
      rechecking the status, but rather assume it's good enough for one round
      only and will need to be rechecked next time.
      
       (6) afs_validate() it triggered on the vnode, probably called from
           d_revalidate() checking the parent directory.
      
       (7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
           update vnode->cb_s_break and assumes the vnode to be invalid.
      
       (8) afs_validate() needs to calls afs_fetch_status().  Go back to step (2)
           and repeat, every time the vnode is validated.
      
      This primarily affects volume root dir vnodes.  Everything subsequent to
      those inherit an already incremented cb_s_break upon mounting.
      
      
      The issue is that we assume that the callback record and the cached permit
      information in a reply from the server can't be trusted after getting a
      server break - but this is wrong since the server makes sure things are
      done in the right order, holding up our ops if necessary[*].
      
       [*] There is an extremely unlikely scenario where a reply from before the
           CB.InitCallBackState could get its delivery deferred till after - at
           which point we think we have a promise when we don't.  This, however,
           requires unlucky mass packet loss to one call.
      
      AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
      a server we've never contacted before, but this should be unnecessary.
      It's also further insulated from the problem on an initial mount by
      querying the server first with FS.GetCapabilities, which triggers the
      CB.InitCallBackState.
      
      
      Fix this by
      
       (1) Remove AFS_SERVER_FL_NEW.
      
       (2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
           calculation.
      
       (3) In afs_cb_is_broken(), don't include cb_s_break in the check.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      eeba1e9c
    • M
      afs: Unlock pages for __pagevec_release() · 21bd68f1
      Marc Dionne 提交于
      __pagevec_release() complains loudly if any page in the vector is still
      locked.  The pages need to be locked for generic_error_remove_page(), but
      that function doesn't actually unlock them.
      
      Unlock the pages afterwards.
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NJonathan Billings <jsbillin@umich.edu>
      21bd68f1
    • D
      afs: Differentiate abort due to unmarshalling from other errors · 8022c4b9
      David Howells 提交于
      Differentiate an abort due to an unmarshalling error from an abort due to
      other errors, such as ENETUNREACH.  It doesn't make sense to set abort code
      RXGEN_*_UNMARSHAL in such a case, so use RX_USER_ABORT instead.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      8022c4b9
    • A
      afs: Avoid section confusion in CM_NAME · d2abfa86
      Andi Kleen 提交于
      __tracepoint_str cannot be const because the tracepoint_str
      section is not read-only. Remove the stray const.
      
      Cc: dhowells@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      d2abfa86
    • A
      afs: avoid deprecated get_seconds() · ba25b81e
      Arnd Bergmann 提交于
      get_seconds() has a limited range on 32-bit architectures and is
      deprecated because of that. While AFS uses the same limits for
      its inode timestamps on the wire protocol, let's just use the
      simpler current_time() as we do for other file systems.
      
      This will still zero out the 'tv_nsec' field of the timestamps
      internally.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      ba25b81e
    • M
      afs: Check for rxrpc call completion in wait loop · f7f1dd31
      Marc Dionne 提交于
      Check the state of the rxrpc call backing an afs call in each iteration of
      the call wait loop in case the rxrpc call has already been terminated at
      the rxrpc layer.
      
      Interrupt the wait loop and mark the afs call as complete if the rxrpc
      layer call is complete.
      
      There were cases where rxrpc errors were not passed up to afs, which could
      result in this loop waiting forever for an afs call to transition to
      AFS_CALL_COMPLETE while the rx call was already complete.
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7f1dd31
    • M
      rxrpc: Make rxrpc_kernel_check_life() indicate if call completed · 4611da30
      Marc Dionne 提交于
      Make rxrpc_kernel_check_life() pass back the life counter through the
      argument list and return true if the call has not yet completed.
      Suggested-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4611da30
  8. 12 4月, 2019 4 次提交
  9. 09 4月, 2019 1 次提交
  10. 07 4月, 2019 1 次提交
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
  11. 06 4月, 2019 1 次提交
  12. 05 4月, 2019 1 次提交
  13. 04 4月, 2019 3 次提交