1. 21 12月, 2020 4 次提交
  2. 19 12月, 2020 1 次提交
  3. 18 12月, 2020 2 次提交
    • P
      io_uring: close a small race gap for files cancel · dfea9fce
      Pavel Begunkov 提交于
      The purpose of io_uring_cancel_files() is to wait for all requests
      matching ->files to go/be cancelled. We should first drop files of a
      request in io_req_drop_files() and only then make it undiscoverable for
      io_uring_cancel_files.
      
      First drop, then delete from list. It's ok to leave req->id->files
      dangling, because it's not dereferenced by cancellation code, only
      compared against. It would potentially go to sleep and be awaken by
      following in io_req_drop_files() wake_up().
      
      Fixes: 0f212204 ("io_uring: don't rely on weak ->files references")
      Cc: <stable@vger.kernel.org> # 5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dfea9fce
    • X
      io_uring: fix io_wqe->work_list corruption · 0020ef04
      Xiaoguang Wang 提交于
      For the first time a req punted to io-wq, we'll initialize io_wq_work's
      list to be NULL, then insert req to io_wqe->work_list. If this req is not
      inserted into tail of io_wqe->work_list, this req's io_wq_work list will
      point to another req's io_wq_work. For splitted bio case, this req maybe
      inserted to io_wqe->work_list repeatedly, once we insert it to tail of
      io_wqe->work_list for the second time, now io_wq_work->list->next will be
      invalid pointer, which then result in many strang error, panic, kernel
      soft-lockup, rcu stall, etc.
      
      In my vm, kernel doest not have commit cc29e1bf ("block: disable
      iopoll for split bio"), below fio job can reproduce this bug steadily:
      [global]
      name=iouring-sqpoll-iopoll-1
      ioengine=io_uring
      iodepth=128
      numjobs=1
      thread
      rw=randread
      direct=1
      registerfiles=1
      hipri=1
      bs=4m
      size=100M
      runtime=120
      time_based
      group_reporting
      randrepeat=0
      
      [device]
      directory=/home/feiman.wxg/mntpoint/  # an ext4 mount point
      
      If we have commit cc29e1bf ("block: disable iopoll for split bio"),
      there will no splitted bio case for polled io, but I think we still to need
      to fix this list corruption, it also should maybe go to stable branchs.
      
      To fix this corruption, if a req is inserted into tail of io_wqe->work_list,
      initialize req->io_wq_work->list->next to bu NULL.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0020ef04
  4. 17 12月, 2020 7 次提交
  5. 16 12月, 2020 17 次提交
  6. 13 12月, 2020 2 次提交
  7. 12 12月, 2020 1 次提交
    • M
      proc: use untagged_addr() for pagemap_read addresses · 40d6366e
      Miles Chen 提交于
      When we try to visit the pagemap of a tagged userspace pointer, we find
      that the start_vaddr is not correct because of the tag.
      To fix it, we should untag the userspace pointers in pagemap_read().
      
      I tested with 5.10-rc4 and the issue remains.
      
      Explanation from Catalin in [1]:
      
       "Arguably, that's a user-space bug since tagged file offsets were never
        supported. In this case it's not even a tag at bit 56 as per the arm64
        tagged address ABI but rather down to bit 47. You could say that the
        problem is caused by the C library (malloc()) or whoever created the
        tagged vaddr and passed it to this function. It's not a kernel
        regression as we've never supported it.
      
        Now, pagemap is a special case where the offset is usually not
        generated as a classic file offset but rather derived by shifting a
        user virtual address. I guess we can make a concession for pagemap
        (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"
      
      My test code is based on [2]:
      
      A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8
      
      userspace program:
      
        uint64 OsLayer::VirtualToPhysical(void *vaddr) {
      	uint64 frame, paddr, pfnmask, pagemask;
      	int pagesize = sysconf(_SC_PAGESIZE);
      	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
      	int fd = open(kPagemapPath, O_RDONLY);
      	...
      
      	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
      		int err = errno;
      		string errtxt = ErrorString(err);
      		if (fd >= 0)
      			close(fd);
      		return 0;
      	}
        ...
        }
      
      kernel fs/proc/task_mmu.c:
      
        static ssize_t pagemap_read(struct file *file, char __user *buf,
      		size_t count, loff_t *ppos)
        {
      	...
      	src = *ppos;
      	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
      	start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
      	end_vaddr = mm->task_size;
      
      	/* watch out for wraparound */
      	// svpfn == 0xb400007662f54
      	// (mm->task_size >> PAGE) == 0x8000000
      	if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
      		start_vaddr = end_vaddr;
      
      	ret = 0;
      	while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
      		int len;
      		unsigned long end;
      		...
      	}
      	...
        }
      
      [1] https://lore.kernel.org/patchwork/patch/1343258/
      [2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158
      
      Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.comSigned-off-by: NMiles Chen <miles.chen@mediatek.com>
      Reviewed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
      Cc: <stable@vger.kernel.org>	[5.4-]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40d6366e
  8. 11 12月, 2020 6 次提交
    • A
      NFS: Disable READ_PLUS by default · 21e31401
      Anna Schumaker 提交于
      We've been seeing failures with xfstests generic/091 and generic/263
      when using READ_PLUS. I've made some progress on these issues, and the
      tests fail later on but still don't pass. Let's disable READ_PLUS by
      default until we can work out what is going on.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      21e31401
    • D
      NFSv4.2: Fix 5 seconds delay when doing inter server copy · fe8eb820
      Dai Ngo 提交于
      Since commit b4868b44 ("NFSv4: Wait for stateid updates after
      CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
      seconds delay regardless of the size of the copy. The delay is from
      nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
      fails because the seqid in both nfs4_state and nfs4_stateid are 0.
      
      Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
      until after the call to update_open_stateid, to indicate this is the 1st
      open. This fix is part of a 2 patches, the other patch is the fix in the
      source server to return the stateid for COPY_NOTIFY request with seqid 1
      instead of 0.
      
      Fixes: ce0887ac ("NFSD add nfs4 inter ssc to nfsd4_copy")
      Signed-off-by: NDai Ngo <dai.ngo@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      fe8eb820
    • C
      NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation · 1c87b851
      Chuck Lever 提交于
      By switching to an XFS-backed export, I am able to reproduce the
      ibcomp worker crash on my client with xfstests generic/013.
      
      For the failing LISTXATTRS operation, xdr_inline_pages() is called
      with page_len=12 and buflen=128.
      
      - When ->send_request() is called, rpcrdma_marshal_req() does not
        set up a Reply chunk because buflen is smaller than the inline
        threshold. Thus rpcrdma_convert_iovs() does not get invoked at
        all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
        on the receive buffer.
      
      - During reply processing, rpcrdma_inline_fixup() tries to copy
        received data into rq_rcv_buf->pages because page_len is positive.
        But there are no receive pages because rpcrdma_marshal_req() never
        allocated them.
      
      The result is that the ibcomp worker faults and dies. Sometimes that
      causes a visible crash, and sometimes it results in a transport hang
      without other symptoms.
      
      RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
      should eventually be fixed or replaced. However, my preference is
      that upper-layer operations should explicitly allocate their receive
      buffers (using GFP_KERNEL) when possible, rather than relying on
      XDRBUF_SPARSE_PAGES.
      Reported-by: NOlga kornievskaia <kolga@netapp.com>
      Suggested-by: NOlga kornievskaia <kolga@netapp.com>
      Fixes: c10a7514 ("NFSv4.2: add the extended attribute proc functions.")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NOlga kornievskaia <kolga@netapp.com>
      Reviewed-by: NFrank van der Linden <fllinden@amazon.com>
      Tested-by: NOlga kornievskaia <kolga@netapp.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1c87b851
    • E
      exec: Transform exec_update_mutex into a rw_semaphore · f7cfd871
      Eric W. Biederman 提交于
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f7cfd871
    • E
      exec: Move io_uring_task_cancel after the point of no return · 9ee1206d
      Eric W. Biederman 提交于
      Now that unshare_files happens in begin_new_exec after the point of no
      return, io_uring_task_cancel can also happen later.
      
      Effectively this means io_uring activities for a task are only canceled
      when exec succeeds.
      
      Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      9ee1206d
    • E
      coredump: Document coredump code exclusively used by cell spufs · c39ab6de
      Eric W. Biederman 提交于
      Oleg Nesterov recently asked[1] why is there an unshare_files in
      do_coredump.  After digging through all of the callers of lookup_fd it
      turns out that it is
      arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
      that needs the unshare_files in do_coredump.
      
      Looking at the history[2] this code was also the only piece of coredump code
      that required the unshare_files when the unshare_files was added.
      
      Looking at that code it turns out that cell is also the only
      architecture that implements elf_coredump_extra_notes_size and
      elf_coredump_extra_notes_write.
      
      I looked at the gdb repo[3] support for cell has been removed[4] in binutils
      2.34.  Geoff Levand reports he is still getting questions on how to
      run modern kernels on the PS3, from people using 3rd party firmware so
      this code is not dead.  According to Wikipedia the last PS3 shipped in
      Japan sometime in 2017.  So it will probably be a little while before
      everyone's hardware dies.
      
      Add some comments briefly documenting the coredump code that exists
      only to support cell spufs to make it easier to understand the
      coredump code.  Eventually the hardware will be dead, or their won't
      be userspace tools, or the coredump code will be refactored and it
      will be too difficult to update a dead architecture and these comments
      make it easy to tell where to pull to remove cell spufs support.
      
      [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
      [2] 179e037f ("do_coredump(): make sure that descriptor table isn't shared")
      [3] git://sourceware.org/git/binutils-gdb.git
      [4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
      Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      c39ab6de