1. 22 2月, 2020 4 次提交
    • X
      alinux: doc: Add Documentation/alibaba/interfaces.rst · 37cf1452
      Xunlei Pang 提交于
      This file collects all the interfaces specific to Alibaba Cloud Kernel.
      
      Add "memory.wmark_min_adj", "memory.exstat", and "zombie memcgs reaper"
      descriptions.
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      37cf1452
    • X
      alinux: memcg: Account throttled time due to memory.wmark_min_adj · 6221288e
      Xunlei Pang 提交于
      Accessing original memory.stat turned out to be one heavy
      operation which has been caused many real product problems.
      
      Introduce new cgroup memory.exstat, memory.exstat stands
      for "extra/extended memory.stat", which contains dedicated
      statistics from Alibaba Clould Kernel.
      
      memory.exstat is supposed to provide hierarchical statistics.
      
      Export its first "wmark_min_throttled_ms", and will add more
      like direct reclaim, direct compaction, etc.
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      6221288e
    • X
      alinux: memcg: Introduce memory.wmark_min_adj · 6bc07d25
      Xunlei Pang 提交于
      In co-location environment, there are more or less some memory
      overcommitment, then BATCH tasks may break the shared global min
      watermark resulting in all types of applications falling into
      the direct reclaim slow path hurting the RT of LS tasks.
      (NOTE: BATCH tasks tolerate big latency spike even in seconds
      as long as doesn't hurt its overal throughput. While LS tasks
      are very Latency-Sensitive, they may time out or fail in case
      of sudden latency spike lasts like hundreds of ms typically.)
      
      Actually BATCH tasks are not sensitive to memory latency, they
      can be assigned a strict min watermark which is different from
      that of LS tasks(which can be aissgned a lenient min watermark
      accordingly), thus isolating each other in case of global memory
      allocation. This is kind of like the idea behind ALLOC_HARDER
      for rt_task(), see gfp_to_alloc_flags().
      
      memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
      it is used to realize separate min watermarks above-mentioned for
      memcgs, its valid value is within [-25, 50], specifically:
      negative value means to be relative to [0, WMARK_MIN],
      positive value means to be relative to [WMARK_MIN, WMARK_LOW].
      For examples,
        -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
         50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"
      
      Note that the minimum -25 is what ALLOC_HARDER uses which is safe
      for us to adopt, and the maximum 50 is one experienced value.
      
      Negative memory.wmark_min_adj means high QoS requirements, it can
      allocate below the global WMARK_MIN, which is kind of like the idea
      behind ALLOC_HARDER, see gfp_to_alloc_flags().
      
      Positive memory.wmark_min_adj means low QoS requirements, thus when
      allocation broke memcg min watermark, it should trigger direct reclaim
      traditionally, and we trigger throttle instead to further prevent
      them from disturbing others.
      
      With this interface, we can assign positive values for BATCH memcgs
      and negative values for LS memcgs.
      
      memory.wmark_min_adj default value is 0, and inherit from its parent,
      Note that the final effective wmark_min_adj will consider all the
      hierarchical values, its value is the maximal(most conservative)
      wmark_min_adj along the hierarchy but excluding intermediate default
      values(zero).
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      6bc07d25
    • X
      alinux: memcg: Provide users the ability to reap zombie memcgs · 867b4772
      Xunlei Pang 提交于
      After memcg was deleted, page caches still reference to this memcg
      causing large number of dead(zombie) memcgs in the system. Then it
      slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to
      tons of iterations, further causing various latencies.
      
      This patch introduces two ways to reclaim these zombie memcgs.
      1) Background kthread reaper
      Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie
      memcgs at background periodically.
      
      Several knobs are also added to control the reaper scan frequency:
      - /sys/kernel/mm/memcg_reaper/scan_interval
        The scan period in second. Default 5s.
      - /sys/kernel/mm/memcg_reaper/pages_scan
        The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page).
      - /sys/kernel/mm/memcg_reaper/verbose
        Output some zombie memcg information for debug purpose. Default off.
      - /sys/kernel/mm/memcg_reaper/reap_background
        "on/off" switch. Default "0" means off. Write "1" to switch it on.
      
      2) One-shot trigger by users
      - /sys/kernel/mm/memcg_reaper/reap
        Write "1" to trigger one round of zombie memcg reaping, but without
        any guarantee, you may need to launch multiple rounds as needed.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      867b4772
  2. 19 2月, 2020 1 次提交
  3. 18 2月, 2020 1 次提交
  4. 17 2月, 2020 16 次提交
  5. 13 2月, 2020 3 次提交
  6. 11 2月, 2020 15 次提交
    • J
      io_uring: ensure registered buffer import returns the IO length · 536f70b4
      Jens Axboe 提交于
      commit 5e559561a8d7e6d4adfce6aa8fbf3daa3dec1577 upstream.
      
      A test case was reported where two linked reads with registered buffers
      failed the second link always. This is because we set the expected value
      of a request in req->result, and if we don't get this result, then we
      fail the dependent links. For some reason the registered buffer import
      returned -ERROR/0, while the normal import returns -ERROR/length. This
      broke linked commands with registered buffers.
      
      Fix this by making io_import_fixed() correctly return the mapped length.
      
      Cc: stable@vger.kernel.org # v5.3
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      536f70b4
    • P
      io_uring: Fix getting file for timeout · 0dfb6df9
      Pavel Begunkov 提交于
      commit 5683e5406e94ae1bfb0d9516a18fdb281d0f8d1d upstream.
      
      For timeout requests io_uring tries to grab a file with specified fd,
      which is usually stdin/fd=0.
      Update io_op_needs_file()
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0dfb6df9
    • J
      io_uring: check for validity of ->rings in teardown · 7e5cf523
      Jens Axboe 提交于
      commit 15dff286d0e0087d4dcd7049911f179e4e4cfd94 upstream.
      
      Normally the rings are always valid, the exception is if we failed to
      allocate the rings at setup time. syzbot reports this:
      
      RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
      RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
      RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
      R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
      RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
      RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
      Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
      ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
      89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
      RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
      RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
      RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
      R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
      R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
      FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
        io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
        io_uring_create fs/io_uring.c:4600 [inline]
        io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
        __do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
        __se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
        __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
        do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441229
      Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
      RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
      RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
      R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      Modules linked in:
      ---[ end trace b0f5b127a57f623f ]---
      RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
      RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
      RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
      Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
      ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
      89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
      RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
      RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
      RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
      R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
      R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
      FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is exactly the case of failing to allocate the SQ/CQ rings, and
      then entering shutdown. Check if the rings are valid before trying to
      access them at shutdown time.
      
      Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
      Fixes: 1d7bb1d50fb4 ("io_uring: add support for backlogged CQ ring")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7e5cf523
    • J
      io_uring: fix potential deadlock in io_poll_wake() · 21c5c08c
      Jens Axboe 提交于
      commit 7c9e7f0fe0d8abf856a957c150c48778e75154c1 upstream.
      
      We attempt to run the poll completion inline, but we're using trylock to
      do so. This avoids a deadlock since we're grabbing the locks in reverse
      order at this point, we already hold the poll wq lock and we're trying
      to grab the completion lock, while the normal rules are the reverse of
      that order.
      
      IO completion for a timeout link will need to grab the completion lock,
      but that's not safe from this context. Put the completion under the
      completion_lock in io_poll_wake(), and mark the request as entering
      the completion with the completion_lock already held.
      
      Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      21c5c08c
    • J
      io_uring: use correct "is IO worker" helper · e60d62d0
      Jens Axboe 提交于
      commit 960e432dfa5927892a9b170d14de874597b84849 upstream.
      
      Since we switched to io-wq, the dependent link optimization for when to
      pass back work inline has been broken. Fix this by providing a suitable
      io-wq helper for io_uring to use to detect when to do this.
      
      Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e60d62d0
    • J
      io_uring: make timeout sequence == 0 mean no sequence · b34219bd
      Jens Axboe 提交于
      commit 93bd25bb69f46367ba8f82c578e0c05702ceb482 upstream.
      
      Currently we make sequence == 0 be the same as sequence == 1, but that's
      not super useful if the intent is really to have a timeout that's just
      a pure timeout.
      
      If the user passes in sqe->off == 0, then don't apply any sequence logic
      to the request, let it purely be driven by the timeout specified.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Reviewed-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b34219bd
    • J
      io_uring: fix -ENOENT issue with linked timer with short timeout · a368a2eb
      Jens Axboe 提交于
      commit 76a46e066e2d93bd333599d1c84c605c2c4cc909 upstream.
      
      If you prep a read (for example) that needs to get punted to async
      context with a timer, if the timeout is sufficiently short, the timer
      request will get completed with -ENOENT as it could not find the read.
      
      The issue is that we prep and start the timer before we start the read.
      Hence the timer can trigger before the read is even started, and the end
      result is then that the timer completes with -ENOENT, while the read
      starts instead of being cancelled by the timer.
      
      Fix this by splitting the linked timer into two parts:
      
      1) Prep and validate the linked timer
      2) Start timer
      
      The read is then started between steps 1 and 2, so we know that the
      timer will always have a consistent view of the read request state.
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a368a2eb
    • J
      io_uring: don't do flush cancel under inflight_lock · 6ee293df
      Jens Axboe 提交于
      commit 768134d4f48109b90f4248feecbeeb7d684e410c upstream.
      
      We can't safely cancel under the inflight lock. If the work hasn't been
      started yet, then io_wq_cancel_work() simply marks the work as cancelled
      and invokes the work handler. But if the work completion needs to grab
      the inflight lock because it's grabbing user files, then we'll deadlock
      trying to finish the work as we already hold that lock.
      
      Instead grab a reference to the request, if it isn't already zero. If
      it's zero, then we know it's going through completion anyway, and we
      can safely ignore it. If it's not zero, then we can drop the lock and
      attempt to cancel from there.
      
      This also fixes a missing finish_wait() at the end of
      io_uring_cancel_files().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      6ee293df
    • J
      io_uring: flag SQPOLL busy condition to userspace · d55737c6
      Jens Axboe 提交于
      commit c1edbf5f081be9fbbea68c1d564b773e59c1acf3 upstream.
      
      Now that we have backpressure, for SQPOLL, we have one more condition
      that warrants flagging that the application needs to enter the kernel:
      we failed to submit IO due to backpressure. Make sure we catch that
      and flag it appropriately.
      
      If we run into backpressure issues with the SQPOLL thread, flag it
      as such to the application by setting IORING_SQ_NEED_WAKEUP. This will
      cause the application to enter the kernel, and that will flush the
      backlog and clear the condition.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d55737c6
    • J
      io_uring: make ASYNC_CANCEL work with poll and timeout · d54f2825
      Jens Axboe 提交于
      commit 47f467686ec02fc07fd5c6bb34b6f6736e2884b0 upstream.
      
      It's a little confusing that we have multiple types of command
      cancellation opcodes now that we have a generic one. Make the generic
      one work with POLL_ADD and TIMEOUT commands as well, that makes for an
      easier to use API for the application. The fact that they currently
      don't is a bit confusing.
      
      Add a helper that takes care of it, so we can user it from both
      IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d54f2825
    • J
      io_uring: provide fallback request for OOM situations · ebb9eff5
      Jens Axboe 提交于
      commit 0ddf92e848ab7abf216f218ee363eb9b9650e98f upstream.
      
      One thing that really sucks for userspace APIs is if the kernel passes
      back -ENOMEM/-EAGAIN for resource shortages. The application really has
      no idea of what to do in those cases. Should it try and reap
      completions? Probably a good idea. Will it solve the issue? Who knows.
      
      This patch adds a simple fallback mechanism if we fail to allocate
      memory for a request. If we fail allocating memory from the slab for a
      request, we punt to a pre-allocated request. There's just one of these
      per io_ring_ctx, but the important part is if we ever return -EBUSY to
      the application, the applications knows that it can wait for events and
      make forward progress when events have completed. This is the important
      part.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      ebb9eff5
    • J
      io_uring: convert accept4() -ERESTARTSYS into -EINTR · ca3e3b02
      Jens Axboe 提交于
      commit 8e3cca12706231daf8daf90dbde59f1665135e48 upstream.
      
      If we cancel a pending accept operating with a signal, we get
      -ERESTARTSYS returned. Turn that into -EINTR for userspace, we should
      not be return -ERESTARTSYS.
      
      Fixes: 17f2fe35d080 ("io_uring: add support for IORING_OP_ACCEPT")
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      ca3e3b02
    • J
      io_uring: fix error clear of ->file_table in io_sqe_files_register() · 7eeb7853
      Jens Axboe 提交于
      commit 46568e9be70ff8211d986685f08d919376c32998 upstream.
      
      syzbot reports that when using failslab and friends, we can get a double
      free in io_sqe_files_unregister():
      
      BUG: KASAN: double-free or invalid-free in
      io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
      
      CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x197/0x210 lib/dump_stack.c:118
        print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
        kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468
        __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450
        kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
        __cache_free mm/slab.c:3426 [inline]
        kfree+0x10a/0x2c0 mm/slab.c:3757
        io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
        io_ring_ctx_free fs/io_uring.c:3998 [inline]
        io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060
        io_uring_release+0x42/0x50 fs/io_uring.c:4068
        __fput+0x2ff/0x890 fs/file_table.c:280
        ____fput+0x16/0x20 fs/file_table.c:313
        task_work_run+0x145/0x1c0 kernel/task_work.c:113
        exit_task_work include/linux/task_work.h:22 [inline]
        do_exit+0x904/0x2e60 kernel/exit.c:817
        do_group_exit+0x135/0x360 kernel/exit.c:921
        __do_sys_exit_group kernel/exit.c:932 [inline]
        __se_sys_exit_group kernel/exit.c:930 [inline]
        __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930
        do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x43f2c8
      Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b
      6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00
      00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48
      RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8
      RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
      RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0
      R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
      R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000
      
      This happens if we fail allocating the file tables. For that case we do
      free the file table correctly, but we forget to set it to NULL. This
      means that ring teardown will see it as being non-NULL, and attempt to
      free it again.
      
      Fix this by clearing the file_table pointer if we free the table.
      
      Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com
      Fixes: 65e19f54d29c ("io_uring: support for larger fixed file sets")
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7eeb7853
    • J
      io_uring: separate the io_free_req and io_free_req_find_next interface · e247932d
      Jackie Liu 提交于
      commit c69f8dbe2426cbf6150407b7e86ce85bb463c1dc upstream.
      
      Similar to the distinction between io_put_req and io_put_req_find_next,
      io_free_req has been modified similarly, with no functional changes.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e247932d
    • J
      io_uring: keep io_put_req only responsible for release and put req · 5a624cf8
      Jackie Liu 提交于
      commit ec9c02ad4c3808d6d9ed28ad1d0485d6e2a33ac5 upstream.
      
      We already have io_put_req_find_next to find the next req of the link.
      we should not use the io_put_req function to find them. They should be
      functions of the same level.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5a624cf8