1. 22 2月, 2021 6 次提交
  2. 13 2月, 2021 1 次提交
    • J
      io-wq: clear out worker ->fs and ->files · e06aa2e9
      Jens Axboe 提交于
      By default, kernel threads have init_fs and init_files assigned. In the
      past, this has triggered security problems, as commands that don't ask
      for (and hence don't get assigned) fs/files from the originating task
      can then attempt path resolution etc with access to parts of the system
      they should not be able to.
      
      Rather than add checks in the fs code for misuse, just set these to
      NULL. If we do attempt to use them, then the resulting code will oops
      rather than provide access to something that it should not permit.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e06aa2e9
  3. 04 2月, 2021 1 次提交
  4. 02 2月, 2021 1 次提交
  5. 21 12月, 2020 1 次提交
  6. 10 12月, 2020 1 次提交
  7. 05 11月, 2020 1 次提交
  8. 22 10月, 2020 1 次提交
  9. 21 10月, 2020 1 次提交
  10. 17 10月, 2020 5 次提交
  11. 01 10月, 2020 5 次提交
    • J
      io-wq: kill unused IO_WORKER_F_EXITING · 145cc8c6
      Jens Axboe 提交于
      This flag is no longer used, remove it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      145cc8c6
    • H
      io-wq: fix use-after-free in io_wq_worker_running · c4068bf8
      Hillf Danton 提交于
      The smart syzbot has found a reproducer for the following issue:
      
       ==================================================================
       BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
       BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
       BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
       BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
       Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771
      
       CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x198/0x1fd lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
        __kasan_report mm/kasan/report.c:513 [inline]
        kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
        check_memory_region_inline mm/kasan/generic.c:186 [inline]
        check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
        instrument_atomic_write include/linux/instrumented.h:71 [inline]
        atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
        io_wqe_inc_running fs/io-wq.c:301 [inline]
        io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
        schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
        io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       Allocated by task 7768:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track mm/kasan/common.c:56 [inline]
        __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
        kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
        kmalloc_node include/linux/slab.h:572 [inline]
        kzalloc_node include/linux/slab.h:677 [inline]
        io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
        io_init_wq_offload fs/io_uring.c:7432 [inline]
        io_sq_offload_start fs/io_uring.c:7504 [inline]
        io_uring_create fs/io_uring.c:8625 [inline]
        io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
        do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
       Freed by task 21:
        kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
        kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
        kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
        __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
        __cache_free mm/slab.c:3418 [inline]
        kfree+0x10e/0x2b0 mm/slab.c:3756
        __io_wq_destroy fs/io-wq.c:1138 [inline]
        io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
        io_finish_async fs/io_uring.c:6836 [inline]
        io_ring_ctx_free fs/io_uring.c:7870 [inline]
        io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
        process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
        worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
        kthread+0x3b5/0x4a0 kernel/kthread.c:292
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
       The buggy address belongs to the object at ffff8882183db000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 140 bytes inside of
        1024-byte region [ffff8882183db000, ffff8882183db400)
       The buggy address belongs to the page:
       page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
       flags: 0x57ffe0000000200(slab)
       raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
       raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ==================================================================
      
      which is down to the comment below,
      
      	/* all workers gone, wq exit can proceed */
      	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
      		complete(&wqe->wq->done);
      
      because there might be multiple cases of wqe in a wq and we would wait
      for every worker in every wqe to go home before releasing wq's resources
      on destroying.
      
      To that end, rework wq's refcount by making it independent of the tracking
      of workers because after all they are two different things, and keeping
      it balanced when workers come and go. Note the manager kthread, like
      other workers, now holds a grab to wq during its lifetime.
      
      Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
      and do nothing for exiting wq.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
      Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4068bf8
    • D
      io_uring: add blkcg accounting to offloaded operations · 91d8f519
      Dennis Zhou 提交于
      There are a few operations that are offloaded to the worker threads. In
      this case, we lose process context and end up in kthread context. This
      results in ios to be not accounted to the issuing cgroup and
      consequently end up as issued by root. Just like others, adopt the
      personality of the blkcg too when issuing via the workqueues.
      
      For the SQPOLL thread, it will live and attach in the inited cgroup's
      context.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91d8f519
    • S
      io_wq: Make io_wqe::lock a raw_spinlock_t · 95da8465
      Sebastian Andrzej Siewior 提交于
      During a context switch the scheduler invokes wq_worker_sleeping() with
      disabled preemption. Disabling preemption is needed because it protects
      access to `worker->sleeping'. As an optimisation it avoids invoking
      schedule() within the schedule path as part of possible wake up (thus
      preempt_enable_no_resched() afterwards).
      
      The io-wq has been added to the mix in the same section with disabled
      preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
      acquires a spinlock_t. Also within the schedule() the spinlock_t must be
      acquired after tsk_is_pi_blocked() otherwise it will block on the
      sleeping lock again while scheduling out.
      
      While playing with `io_uring-bench' I didn't notice a significant
      latency spike after converting io_wqe::lock to a raw_spinlock_t. The
      latency was more or less the same.
      
      In order to keep the spinlock_t it would have to be moved after the
      tsk_is_pi_blocked() check which would introduce a branch instruction
      into the hot path.
      
      The lock is used to maintain the `work_list' and wakes one task up at
      most.
      Should io_wqe_cancel_pending_work() cause latency spikes, while
      searching for a specific item, then it would need to drop the lock
      during iterations.
      revert_creds() is also invoked under the lock. According to debug
      cred::non_rcu is 0. Otherwise it should be moved outside of the locked
      section because put_cred_rcu()->free_uid() acquires a sleeping lock.
      
      Convert io_wqe::lock to a raw_spinlock_t.c
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      95da8465
    • J
      io_uring: reference ->nsproxy for file table commands · 9b828492
      Jens Axboe 提交于
      If we don't get and assign the namespace for the async work, then certain
      paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
      Anything that references the current namespace of the given task should
      be assigned for async work on behalf of that task.
      
      Cc: stable@vger.kernel.org # v5.5+
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b828492
  12. 24 8月, 2020 1 次提交
  13. 25 7月, 2020 2 次提交
  14. 27 6月, 2020 1 次提交
  15. 15 6月, 2020 3 次提交
  16. 11 6月, 2020 3 次提交
  17. 09 6月, 2020 1 次提交
  18. 04 4月, 2020 1 次提交
  19. 24 3月, 2020 1 次提交
    • P
      io-wq: handle hashed writes in chains · 86f3cd1b
      Pavel Begunkov 提交于
      We always punt async buffered writes to an io-wq helper, as the core
      kernel does not have IOCB_NOWAIT support for that. Most buffered async
      writes complete very quickly, as it's just a copy operation. This means
      that doing multiple locking roundtrips on the shared wqe lock for each
      buffered write is wasteful. Additionally, buffered writes are hashed
      work items, which means that any buffered write to a given file is
      serialized.
      
      Keep identicaly hashed work items contiguously in @wqe->work_list, and
      track a tail for each hash bucket. On dequeue of a hashed item, splice
      all of the same hash in one go using the tracked tail. Until the batch
      is done, the caller doesn't have to synchronize with the wqe or worker
      locks again.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86f3cd1b
  20. 23 3月, 2020 1 次提交
  21. 15 3月, 2020 2 次提交