1. 10 5月, 2023 14 次提交
    • G
      net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg · 86d304ad
      Gwangun Jung 提交于
      stable inclusion
      from stable-v5.10.179
      commit ddcf35deb8f2a1d9addc74b586cf4c5a1f5d6020
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6ZISA
      CVE: CVE-2023-31436
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ddcf35deb8f2a1d9addc74b586cf4c5a1f5d6020
      
      --------------------------------
      
      [ Upstream commit 30379334 ]
      
      If the TCA_QFQ_LMAX value is not offered through nlattr, lmax is determined by the MTU value of the network device.
      The MTU of the loopback device can be set up to 2^31-1.
      As a result, it is possible to have an lmax value that exceeds QFQ_MIN_LMAX.
      
      Due to the invalid lmax value, an index is generated that exceeds the QFQ_MAX_INDEX(=24) value, causing out-of-bounds read/write errors.
      
      The following reports a oob access:
      
      [   84.582666] BUG: KASAN: slab-out-of-bounds in qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
      [   84.583267] Read of size 4 at addr ffff88810f676948 by task ping/301
      [   84.583686]
      [   84.583797] CPU: 3 PID: 301 Comm: ping Not tainted 6.3.0-rc5 #1
      [   84.584164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
      [   84.584644] Call Trace:
      [   84.584787]  <TASK>
      [   84.584906] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
      [   84.585108] print_report (mm/kasan/report.c:320 mm/kasan/report.c:430)
      [   84.585570] kasan_report (mm/kasan/report.c:538)
      [   84.585988] qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
      [   84.586599] qfq_enqueue (net/sched/sch_qfq.c:1255)
      [   84.587607] dev_qdisc_enqueue (net/core/dev.c:3776)
      [   84.587749] __dev_queue_xmit (./include/net/sch_generic.h:186 net/core/dev.c:3865 net/core/dev.c:4212)
      [   84.588763] ip_finish_output2 (./include/net/neighbour.h:546 net/ipv4/ip_output.c:228)
      [   84.589460] ip_output (net/ipv4/ip_output.c:430)
      [   84.590132] ip_push_pending_frames (./include/net/dst.h:444 net/ipv4/ip_output.c:126 net/ipv4/ip_output.c:1586 net/ipv4/ip_output.c:1606)
      [   84.590285] raw_sendmsg (net/ipv4/raw.c:649)
      [   84.591960] sock_sendmsg (net/socket.c:724 net/socket.c:747)
      [   84.592084] __sys_sendto (net/socket.c:2142)
      [   84.593306] __x64_sys_sendto (net/socket.c:2150)
      [   84.593779] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
      [   84.593902] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      [   84.594070] RIP: 0033:0x7fe568032066
      [   84.594192] Code: 0e 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c09[ 84.594796] RSP: 002b:00007ffce388b4e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      
      Code starting with the faulting instruction
      ===========================================
      [   84.595047] RAX: ffffffffffffffda RBX: 00007ffce388cc70 RCX: 00007fe568032066
      [   84.595281] RDX: 0000000000000040 RSI: 00005605fdad6d10 RDI: 0000000000000003
      [   84.595515] RBP: 00005605fdad6d10 R08: 00007ffce388eeec R09: 0000000000000010
      [   84.595749] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
      [   84.595984] R13: 00007ffce388cc30 R14: 00007ffce388b4f0 R15: 0000001d00000001
      [   84.596218]  </TASK>
      [   84.596295]
      [   84.596351] Allocated by task 291:
      [   84.596467] kasan_save_stack (mm/kasan/common.c:46)
      [   84.596597] kasan_set_track (mm/kasan/common.c:52)
      [   84.596725] __kasan_kmalloc (mm/kasan/common.c:384)
      [   84.596852] __kmalloc_node (./include/linux/kasan.h:196 mm/slab_common.c:967 mm/slab_common.c:974)
      [   84.596979] qdisc_alloc (./include/linux/slab.h:610 ./include/linux/slab.h:731 net/sched/sch_generic.c:938)
      [   84.597100] qdisc_create (net/sched/sch_api.c:1244)
      [   84.597222] tc_modify_qdisc (net/sched/sch_api.c:1680)
      [   84.597357] rtnetlink_rcv_msg (net/core/rtnetlink.c:6174)
      [   84.597495] netlink_rcv_skb (net/netlink/af_netlink.c:2574)
      [   84.597627] netlink_unicast (net/netlink/af_netlink.c:1340 net/netlink/af_netlink.c:1365)
      [   84.597759] netlink_sendmsg (net/netlink/af_netlink.c:1942)
      [   84.597891] sock_sendmsg (net/socket.c:724 net/socket.c:747)
      [   84.598016] ____sys_sendmsg (net/socket.c:2501)
      [   84.598147] ___sys_sendmsg (net/socket.c:2557)
      [   84.598275] __sys_sendmsg (./include/linux/file.h:31 net/socket.c:2586)
      [   84.598399] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
      [   84.598520] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      [   84.598688]
      [   84.598744] The buggy address belongs to the object at ffff88810f674000
      [   84.598744]  which belongs to the cache kmalloc-8k of size 8192
      [   84.599135] The buggy address is located 2664 bytes to the right of
      [   84.599135]  allocated 7904-byte region [ffff88810f674000, ffff88810f675ee0)
      [   84.599544]
      [   84.599598] The buggy address belongs to the physical page:
      [   84.599777] page:00000000e638567f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10f670
      [   84.600074] head:00000000e638567f order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
      [   84.600330] flags: 0x200000000010200(slab|head|node=0|zone=2)
      [   84.600517] raw: 0200000000010200 ffff888100043180 dead000000000122 0000000000000000
      [   84.600764] raw: 0000000000000000 0000000080020002 00000001ffffffff 0000000000000000
      [   84.601009] page dumped because: kasan: bad access detected
      [   84.601187]
      [   84.601241] Memory state around the buggy address:
      [   84.601396]  ffff88810f676800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   84.601620]  ffff88810f676880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   84.601845] >ffff88810f676900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   84.602069]                                               ^
      [   84.602243]  ffff88810f676980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   84.602468]  ffff88810f676a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   84.602693] ==================================================================
      [   84.602924] Disabling lock debugging due to kernel taint
      
      Fixes: 3015f3d2 ("pkt_sched: enable QFQ to support TSO/GSO")
      Reported-by: NGwangun Jung <exsociety@gmail.com>
      Signed-off-by: NGwangun Jung <exsociety@gmail.com>
      Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      86d304ad
    • W
      i2c: xgene-slimpro: Fix out-of-bounds bug in xgene_slimpro_i2c_xfer() · 86923497
      Wei Chen 提交于
      mainline inclusion
      from mainline-v6.3-rc4
      commit 92fbb6d1
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6XHPL
      CVE: CVE-2023-2194
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=92fbb6d1296f81f41f65effd7f5f8c0f74943d15
      
      --------------------------------
      
      The data->block[0] variable comes from user and is a number between
      0-255. Without proper check, the variable may be very large to cause
      an out-of-bounds when performing memcpy in slimpro_i2c_blkwr.
      
      Fix this bug by checking the value of writelen.
      
      Fixes: f6505fba ("i2c: add SLIMpro I2C device driver on APM X-Gene platform")
      Signed-off-by: NWei Chen <harperchen1110@gmail.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndi Shyti <andi.shyti@kernel.org>
      Signed-off-by: NWolfram Sang <wsa@kernel.org>
      Signed-off-by: NYang Jihong <yangjihong1@huawei.com>
      Reviewed-by: NZheng Yejian <zhengyejian1@huawei.com>
      Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      86923497
    • B
      ext4: only update i_reserved_data_blocks on successful block allocation · f0af88ce
      Baokun Li 提交于
      maillist inclusion
      category: bugfix
      bugzilla: 188499, https://gitee.com/openeuler/kernel/issues/I6TNVT
      CVE: NA
      
      Reference: https://patchwork.ozlabs.org/project/linux-ext4/patch/20230412124126.2286716-2-libaokun1@huawei.com/
      
      ----------------------------------------
      
      In our fault injection test, we create an ext4 file, migrate it to
      non-extent based file, then punch a hole and finally trigger a WARN_ON
      in the ext4_da_update_reserve_space():
      
      EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
      ino 14, used 11 with only 10 reserved data blocks
      
      When writing back a non-extent based file, if we enable delalloc, the
      number of reserved blocks will be subtracted from the number of blocks
      mapped by ext4_ind_map_blocks(), and the extent status tree will be
      updated. We update the extent status tree by first removing the old
      extent_status and then inserting the new extent_status. If the block range
      we remove happens to be in an extent, then we need to allocate another
      extent_status with ext4_es_alloc_extent().
      
             use old    to remove   to add new
          |----------|------------|------------|
                    old extent_status
      
      The problem is that the allocation of a new extent_status failed due to a
      fault injection, and __es_shrink() did not get free memory, resulting in
      a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
      we map to the same extent again, and the number of reserved blocks is again
      subtracted from the number of blocks in that extent. Since the blocks in
      the same extent are subtracted twice, we end up triggering WARN_ON at
      ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
      
      For non-extent based file, we update the number of reserved blocks after
      ext4_ind_map_blocks() is executed, which causes a problem that when we call
      ext4_ind_map_blocks() to create a block, it doesn't always create a block,
      but we always reduce the number of reserved blocks. So we move the logic
      for updating reserved blocks to ext4_ind_map_blocks() to ensure that the
      number of reserved blocks is updated only after we do succeed in allocating
      some new blocks.
      
      Fixes: 5f634d06 ("ext4: Fix quota accounting error with fallocate")
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NYang Erkun <yangerkun@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      f0af88ce
    • O
      can: af_can: fix NULL pointer dereference in can_rcv_filter · a6b58c2d
      Oliver Hartkopp 提交于
      stable inclusion
      from stable-v5.10.159
      commit c42221efb1159d6a3c89e96685ee38acdce86b6f
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6WUDS
      CVE: CVE-2023-2166
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c42221efb1159d6a3c89e96685ee38acdce86b6f
      
      --------------------------------
      
      commit 0acc4423 upstream.
      
      Analogue to commit 8aa59e35 ("can: af_can: fix NULL pointer
      dereference in can_rx_register()") we need to check for a missing
      initialization of ml_priv in the receive path of CAN frames.
      
      Since commit 4e096a18 ("net: introduce CAN specific pointer in the
      struct net_device") the check for dev->type to be ARPHRD_CAN is not
      sufficient anymore since bonding or tun netdevices claim to be CAN
      devices but do not initialize ml_priv accordingly.
      
      Fixes: 4e096a18 ("net: introduce CAN specific pointer in the struct net_device")
      Reported-by: syzbot+2d7f58292cb5b29eb5ad@syzkaller.appspotmail.com
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Link: https://lore.kernel.org/all/20221206201259.3028-1-socketcan@hartkopp.net
      Cc: stable@vger.kernel.org
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      a6b58c2d
    • P
      RDMA/core: Refactor rdma_bind_addr · e51e93cb
      Patrisious Haddad 提交于
      mainline inclusion
      from mainline-v6.3-rc1
      commit 8d037973
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6X49E
      CVE: CVE-2023-2176
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d037973d48c026224ab285e6a06985ccac6f7bf
      
      ---------------------------
      
      Refactor rdma_bind_addr function so that it doesn't require that the
      cma destination address be changed before calling it.
      
      So now it will update the destination address internally only when it is
      really needed and after passing all the required checks.
      
      Which in turn results in a cleaner and more sensible call and error
      handling flows for the functions that call it directly or indirectly.
      Signed-off-by: NPatrisious Haddad <phaddad@nvidia.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Reviewed-by: NMark Zhang <markzhang@nvidia.com>
      Link: https://lore.kernel.org/r/3d0e9a2fd62bc10ba02fed1c7c48a48638952320.1672819273.git.leonro@nvidia.comSigned-off-by: NLeon Romanovsky <leon@kernel.org>
      (cherry picked from commit 8d037973)
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      e51e93cb
    • J
      RDMA/cma: Ensure rdma_addr_cancel() happens before issuing more requests · 7767949c
      Jason Gunthorpe 提交于
      mainline inclusion
      from mainline-v5.15-rc4
      commit 305d568b
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6X49E
      CVE: CVE-2023-2176
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=305d568b72f17f674155a2a8275f865f207b3808
      
      ---------------------------
      
      The FSM can run in a circle allowing rdma_resolve_ip() to be called twice
      on the same id_priv. While this cannot happen without going through the
      work, it violates the invariant that the same address resolution
      background request cannot be active twice.
      
             CPU 1                                  CPU 2
      
      rdma_resolve_addr():
        RDMA_CM_IDLE -> RDMA_CM_ADDR_QUERY
        rdma_resolve_ip(addr_handler)  #1
      
      			 process_one_req(): for #1
                                addr_handler():
                                  RDMA_CM_ADDR_QUERY -> RDMA_CM_ADDR_BOUND
                                  mutex_unlock(&id_priv->handler_mutex);
                                  [.. handler still running ..]
      
      rdma_resolve_addr():
        RDMA_CM_ADDR_BOUND -> RDMA_CM_ADDR_QUERY
        rdma_resolve_ip(addr_handler)
          !! two requests are now on the req_list
      
      rdma_destroy_id():
       destroy_id_handler_unlock():
        _destroy_id():
         cma_cancel_operation():
          rdma_addr_cancel()
      
                                // process_one_req() self removes it
      		          spin_lock_bh(&lock);
                                 cancel_delayed_work(&req->work);
      	                   if (!list_empty(&req->list)) == true
      
            ! rdma_addr_cancel() returns after process_on_req #1 is done
      
         kfree(id_priv)
      
      			 process_one_req(): for #2
                                addr_handler():
      	                    mutex_lock(&id_priv->handler_mutex);
                                  !! Use after free on id_priv
      
      rdma_addr_cancel() expects there to be one req on the list and only
      cancels the first one. The self-removal behavior of the work only happens
      after the handler has returned. This yields a situations where the
      req_list can have two reqs for the same "handle" but rdma_addr_cancel()
      only cancels the first one.
      
      The second req remains active beyond rdma_destroy_id() and will
      use-after-free id_priv once it inevitably triggers.
      
      Fix this by remembering if the id_priv has called rdma_resolve_ip() and
      always cancel before calling it again. This ensures the req_list never
      gets more than one item in it and doesn't cost anything in the normal flow
      that never uses this strange error path.
      
      Link: https://lore.kernel.org/r/0-v1-3bc675b8006d+22-syz_cancel_uaf_jgg@nvidia.com
      Cc: stable@vger.kernel.org
      Fixes: e51060f0 ("IB: IP address based RDMA connection manager")
      Reported-by: syzbot+dc3dfba010d7671e05f5@syzkaller.appspotmail.com
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      (cherry picked from commit 305d568b)
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      
      Conflicts:
      	drivers/infiniband/core/cma_priv.h
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      7767949c
    • A
      scsi: dpt_i2o: Remove obsolete driver · 49d0287b
      Arnd Bergmann 提交于
      mainline inclusion
      from mainline-v6.0-rc1~14
      commit b04e75a4
      category: bugfix
      bugzilla: 188707, https://gitee.com/src-openeuler/kernel/issues/I6VK2F
      CVE: CVE-2023-2007
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04e75a4a8a81887386a0d2dbf605a48e779d2a0
      
      ----------------------------------------
      
      The dpt_i2o driver was fixed to stop using virt_to_bus() in 2008, but it
      still has a stale reference in an error handling code path that could never
      work. I submitted a patch to fix this reference earlier, but Hannes
      Reinecke suggested that removing the driver may be just as good here.
      
      The i2o driver layer was removed in 2015 with commit 4a72a7af
      ("staging: remove i2o subsystem"), but the even older dpt_i2o scsi driver
      stayed around.
      
      The last non-cleanup patches I could find were from Miquel van Smoorenburg
      and Mark Salyzyn back in 2008, they might know if there is any chance of
      the hardware still being used anywhere.
      
      Link: https://lore.kernel.org/linux-scsi/CAK8P3a1XfwkTOV7qOs1fTxf4vthNBRXKNu8A5V7TWnHT081NGA@mail.gmail.com/T/
      Link: https://lore.kernel.org/r/20220624155226.2889613-3-arnd@kernel.org
      Cc: Miquel van Smoorenburg <mikevs@xs4all.net>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      49d0287b
    • B
      writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs · 79c2ed51
      Baokun Li 提交于
      mainline inclusion
      from mainline-v6.3-rc8
      commit 1ba1199e
      category: bugfix
      bugzilla: 188601, https://gitee.com/openeuler/kernel/issues/I6TNTC
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ba1199ec5747f475538c0d25a32804e5ba1dfde
      
      --------------------------------
      
      KASAN report null-ptr-deref:
      ==================================================================
      BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
      Write of size 8 at addr 0000000000000000 by task sync/943
      CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
      Call Trace:
       <TASK>
       dump_stack_lvl+0x7f/0xc0
       print_report+0x2ba/0x340
       kasan_report+0xc4/0x120
       kasan_check_range+0x1b7/0x2e0
       __kasan_check_write+0x24/0x40
       bdi_split_work_to_wbs+0x5c5/0x7b0
       sync_inodes_sb+0x195/0x630
       sync_inodes_one_sb+0x3a/0x50
       iterate_supers+0x106/0x1b0
       ksys_sync+0x98/0x160
      [...]
      ==================================================================
      
      The race that causes the above issue is as follows:
      
                 cpu1                     cpu2
      -------------------------|-------------------------
      inode_switch_wbs
       INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
       queue_rcu_work(isw_wq, &isw->work)
       // queue_work async
        inode_switch_wbs_work_fn
         wb_put_many(old_wb, nr_switched)
          percpu_ref_put_many
           ref->data->release(ref)
           cgwb_release
            queue_work(cgwb_release_wq, &wb->release_work)
            // queue_work async
             &wb->release_work
             cgwb_release_workfn
                                  ksys_sync
                                   iterate_supers
                                    sync_inodes_one_sb
                                     sync_inodes_sb
                                      bdi_split_work_to_wbs
                                       kmalloc(sizeof(*work), GFP_ATOMIC)
                                       // alloc memory failed
              percpu_ref_exit
               ref->data = NULL
               kfree(data)
                                       wb_get(wb)
                                        percpu_ref_get(&wb->refcnt)
                                         percpu_ref_get_many(ref, 1)
                                          atomic_long_add(nr, &ref->data->count)
                                           atomic64_add(i, v)
                                           // trigger null-ptr-deref
      
      bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
      wbs.  If the allocation of new work fails, the on-stack fallback will be
      used and the reference count of the current wb is increased afterwards.
      If cgroup writeback membership switches occur before getting the reference
      count and the current wb is released as old_wd, then calling wb_get() or
      wb_put() will trigger the null pointer dereference above.
      
      This issue was introduced in v4.3-rc7 (see fix tag1).  Both
      sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
      bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
      sync_inodes_sb(), originally commit 7fc5854f ("writeback: synchronize
      sync(2) against cgroup writeback membership switches") reduced the
      possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
      fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
      inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
      thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
      and the issue becomes easily reproducible again.
      
      To solve this problem, percpu_ref_exit() is called under RCU protection to
      avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
      Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
      and skip the current wb if wb_tryget() fails because the wb has already
      been shutdown.
      
      Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
      Fixes: b817525a ("writeback: bdi_writeback iteration must not skip dying ones")
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Cc: yangerkun <yangerkun@huawei.com>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      
      Conflicts:
      	mm/backing-dev.c
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NYang Erkun <yangerkun@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      79c2ed51
    • L
      bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser() · 4be06acd
      Liu Jian 提交于
      mainline inclusion
      from mainline-v6.3-rc2
      commit d900f3d2
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d900f3d20cc3169ce42ec72acc850e662a4d4db2
      
      ---------------------------
      
      When the buffer length of the recvmsg system call is 0, we got the
      flollowing soft lockup problem:
      
      watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [a.out:6149]
      CPU: 3 PID: 6149 Comm: a.out Kdump: loaded Not tainted 6.2.0+ #30
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
      RIP: 0010:remove_wait_queue+0xb/0xc0
      Code: 5e 41 5f c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 48 89 fd 53 48 89 f3 4c 8d 6b 18 4c 8d 73 20
      RSP: 0018:ffff88811b5978b8 EFLAGS: 00000246
      RAX: 0000000000000000 RBX: ffff88811a7d3780 RCX: ffffffffb7a4d768
      RDX: dffffc0000000000 RSI: ffff88811b597908 RDI: ffff888115408040
      RBP: 1ffff110236b2f1b R08: 0000000000000000 R09: ffff88811a7d37e7
      R10: ffffed10234fa6fc R11: 0000000000000001 R12: ffff88811179b800
      R13: 0000000000000001 R14: ffff88811a7d38a8 R15: ffff88811a7d37e0
      FS:  00007f6fb5398740(0000) GS:ffff888237180000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000000 CR3: 000000010b6ba002 CR4: 0000000000370ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_msg_wait_data+0x279/0x2f0
       tcp_bpf_recvmsg_parser+0x3c6/0x490
       inet_recvmsg+0x280/0x290
       sock_recvmsg+0xfc/0x120
       ____sys_recvmsg+0x160/0x3d0
       ___sys_recvmsg+0xf0/0x180
       __sys_recvmsg+0xea/0x1a0
       do_syscall_64+0x3f/0x90
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      The logic in tcp_bpf_recvmsg_parser is as follows:
      
      msg_bytes_ready:
      	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
      	if (!copied) {
      		wait data;
      		goto msg_bytes_ready;
      	}
      
      In this case, "copied" always is 0, the infinite loop occurs.
      
      According to the Linux system call man page, 0 should be returned in this
      case. Therefore, in tcp_bpf_recvmsg_parser(), if the length is 0, directly
      return. Also modify several other functions with the same problem.
      
      Fixes: 1f5be6b3 ("udp: Implement udp_bpf_recvmsg() for sockmap")
      Fixes: 9825d866 ("af_unix: Implement unix_dgram_bpf_recvmsg()")
      Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Cc: Jakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230303080946.1146638-1-liujian56@huawei.comSigned-off-by: NLiu Jian <liujian56@huawei.com>
      
      Conflicts:
      	net/ipv4/udp_bpf.c
      	net/unix/unix_bpf.c
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      4be06acd
    • J
      bpf, sockmap: Fix double bpf_prog_put on error case in map_link · 220b8487
      John Fastabend 提交于
      mainline inclusion
      from mainline-v5.17-rc1
      commit 218d747a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=218d747a4142f281a256687bb513a135c905867b
      
      ---------------------------
      
      sock_map_link() is called to update a sockmap entry with a sk. But, if the
      sock_map_init_proto() call fails then we return an error to the map_update
      op against the sockmap. In the error path though we need to cleanup psock
      and dec the refcnt on any programs associated with the map, because we
      refcnt them early in the update process to ensure they are pinned for the
      psock. (This avoids a race where user deletes programs while also updating
      the map with new socks.)
      
      In current code we do the prog refcnt dec explicitely by calling
      bpf_prog_put() when the program was found in the map. But, after commit
      '38207a5e' in this error path we've already done the prog to psock
      assignment so the programs have a reference from the psock as well. This
      then causes the psock tear down logic, invoked by sk_psock_put() in the
      error path, to similarly call bpf_prog_put on the programs there.
      
      To be explicit this logic does the prog->psock assignment:
      
        if (msg_*)
          psock_set_prog(...)
      
      Then the error path under the out_progs label does a similar check and
      dec with:
      
        if (msg_*)
           bpf_prog_put(...)
      
      And the teardown logic sk_psock_put() does ...
      
        psock_set_prog(msg_*, NULL)
      
      ... triggering another bpf_prog_put(...). Then KASAN gives us this splat,
      found by syzbot because we've created an inbalance between bpf_prog_inc and
      bpf_prog_put calling put twice on the program.
      
        BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline]
        BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] kernel/bpf/syscall.c:1829
        BUG: KASAN: vmalloc-out-of-bounds in bpf_prog_put+0x8c/0x4f0 kernel/bpf/syscall.c:1829 kernel/bpf/syscall.c:1829
        Read of size 8 at addr ffffc90000e76038 by task syz-executor020/3641
      
      To fix clean up error path so it doesn't try to do the bpf_prog_put in the
      error path once progs are assigned then it relies on the normal psock
      tear down logic to do complete cleanup.
      
      For completness we also cover the case whereh sk_psock_init_strp() fails,
      but this is not expected because it indicates an incorrect socket type
      and should be caught earlier.
      
      Fixes: 38207a5e ("bpf, sockmap: Attach map progs to psock early for feature probes")
      Reported-by: syzbot+bb73e71cf4b8fd376a4f@syzkaller.appspotmail.com
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220104214645.290900-1-john.fastabend@gmail.com
      (cherry picked from commit 218d747a)
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      
      Conflicts:
      	net/core/sock_map.c
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      220b8487
    • J
      bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap · cbc0f0ca
      John Fastabend 提交于
      mainline inclusion
      from mainline-v5.16-rc5
      commit c0d95d33
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c0d95d3380ee099d735e08618c0d599e72f6c8b0
      
      ---------------------------
      
      When a sock is added to a sock map we evaluate what proto op hooks need to
      be used. However, when the program is removed from the sock map we have not
      been evaluating if that changes the required program layout.
      
      Before the patch listed in the 'fixes' tag this was not causing failures
      because the base program set handles all cases. Specifically, the case with
      a stream parser and the case with out a stream parser are both handled. With
      the fix below we identified a race when running with a proto op that attempts
      to read skbs off both the stream parser and the skb->receive_queue. Namely,
      that a race existed where when the stream parser is empty checking the
      skb->receive_queue from recvmsg at the precies moment when the parser is
      paused and the receive_queue is not empty could result in skipping the stream
      parser. This may break a RX policy depending on the parser to run.
      
      The fix tag then loads a specific proto ops that resolved this race. But, we
      missed removing that proto ops recv hook when the sock is removed from the
      sockmap. The result is the stream parser is stopped so no more skbs will be
      aggregated there, but the hook and BPF program continues to be attached on
      the psock. User space will then get an EBUSY when trying to read the socket
      because the recvmsg() handler is now waiting on a stopped stream parser.
      
      To fix we rerun the proto ops init() function which will look at the new set
      of progs attached to the psock and rest the proto ops hook to the correct
      handlers. And in the above case where we remove the sock from the sock map
      the RX prog will no longer be listed so the proto ops is removed.
      
      Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com
      (cherry picked from commit c0d95d33)
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      
      Conflicts:
      	net/core/skmsg.c
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      cbc0f0ca
    • J
      bpf, sockmap: Attach map progs to psock early for feature probes · 75922849
      John Fastabend 提交于
      mainline inclusion
      from mainline-v5.16-rc5
      commit 38207a5e
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=38207a5e81230d6ffbdd51e5fa5681be5116dcae
      
      ---------------------------
      
      When a TCP socket is added to a sock map we look at the programs attached
      to the map to determine what proto op hooks need to be changed. Before
      the patch in the 'fixes' tag there were only two categories -- the empty
      set of programs or a TX policy. In any case the base set handled the
      receive case.
      
      After the fix we have an optimized program for receive that closes a small,
      but possible, race on receive. This program is loaded only when the map the
      psock is being added to includes a RX policy. Otherwise, the race is not
      possible so we don't need to handle the race condition.
      
      In order for the call to sk_psock_init() to correctly evaluate the above
      conditions all progs need to be set in the psock before the call. However,
      in the current code this is not the case. We end up evaluating the
      requirements on the old prog state. If your psock is attached to multiple
      maps -- for example a tx map and rx map -- then the second update would pull
      in the correct maps. But, the other pattern with a single rx enabled map
      the correct receive hooks are not used. The result is the race fixed by the
      patch in the fixes tag below may still be seen in this case.
      
      To fix we simply set all psock->progs before doing the call into
      sock_map_init(). With this the init() call gets the full list of programs
      and chooses the correct proto ops on the first iteration instead of
      requiring the second update to pull them in. This fixes the race case when
      only a single map is used.
      
      Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20211119181418.353932-2-john.fastabend@gmail.com
      (cherry picked from commit 38207a5e)
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      
      Conflicts:
      	net/core/sock_map.c
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      75922849
    • J
      bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser() · d3bb682c
      John Fastabend 提交于
      mainline inclusion
      from mainline-v5.17-rc1
      commit 5b2c5540
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b2c5540b8110eea0d67a78fb0ddb9654c58daeb
      
      ---------------------------
      
      Applications can be confused slightly because we do not always return the
      same error code as expected, e.g. what the TCP stack normally returns. For
      example on a sock err sk->sk_err instead of returning the sock_error we
      return EAGAIN. This usually means the application will 'try again'
      instead of aborting immediately. Another example, when a shutdown event
      is received we should immediately abort instead of waiting for data when
      the user provides a timeout.
      
      These tend to not be fatal, applications usually recover, but introduces
      bogus errors to the user or introduces unexpected latency. Before
      'c5d2177a' we fell back to the TCP stack when no data was available
      so we managed to catch many of the cases here, although with the extra
      latency cost of calling tcp_msg_wait_data() first.
      
      To fix lets duplicate the error handling in TCP stack into tcp_bpf so
      that we get the same error codes.
      
      These were found in our CI tests that run applications against sockmap
      and do longer lived testing, at least compared to test_sockmap that
      does short-lived ping/pong tests, and in some of our test clusters
      we deploy.
      
      Its non-trivial to do these in a shorter form CI tests that would be
      appropriate for BPF selftests, but we are looking into it so we can
      ensure this keeps working going forward. As a preview one idea is to
      pull in the packetdrill testing which catches some of this.
      
      Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com
      (cherry picked from commit 5b2c5540)
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      
      Conflicts:
      	net/ipv4/tcp_bpf.c
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      d3bb682c
    • J
      bpf, sockmap: Fix race in ingress receive verdict with redirect to self · 9a9749fb
      John Fastabend 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit c5d2177a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c5d2177a72a1659554922728fc407f59950aa929
      
      ---------------------------
      
      A socket in a sockmap may have different combinations of programs attached
      depending on configuration. There can be no programs in which case the socket
      acts as a sink only. There can be a TX program in this case a BPF program is
      attached to sending side, but no RX program is attached. There can be an RX
      program only where sends have no BPF program attached, but receives are hooked
      with BPF. And finally, both TX and RX programs may be attached. Giving us the
      permutations:
      
       None, Tx, Rx, and TxRx
      
      To date most of our use cases have been TX case being used as a fast datapath
      to directly copy between local application and a userspace proxy. Or Rx cases
      and TxRX applications that are operating an in kernel based proxy. The traffic
      in the first case where we hook applications into a userspace application looks
      like this:
      
        AppA  redirect   AppB
         Tx <-----------> Rx
         |                |
         +                +
         TCP <--> lo <--> TCP
      
      In this case all traffic from AppA (after 3whs) is copied into the AppB
      ingress queue and no traffic is ever on the TCP recieive_queue.
      
      In the second case the application never receives, except in some rare error
      cases, traffic on the actual user space socket. Instead the send happens in
      the kernel.
      
                 AppProxy       socket pool
             sk0 ------------->{sk1,sk2, skn}
              ^                      |
              |                      |
              |                      v
             ingress              lb egress
             TCP                  TCP
      
      Here because traffic is never read off the socket with userspace recv() APIs
      there is only ever one reader on the sk receive_queue. Namely the BPF programs.
      
      However, we've started to introduce a third configuration where the BPF program
      on receive should process the data, but then the normal case is to push the
      data into the receive queue of AppB.
      
             AppB
             recv()                (userspace)
           -----------------------
             tcp_bpf_recvmsg()     (kernel)
               |             |
               |             |
               |             |
             ingress_msgQ    |
               |             |
             RX_BPF          |
               |             |
               v             v
             sk->receive_queue
      
      This is different from the App{A,B} redirect because traffic is first received
      on the sk->receive_queue.
      
      Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
      queue for any data handled by the BPF rx program and returned with PASS code
      so that it was enqueued on the ingress msg queue. Then if no data exists on
      that queue it checks the socket receive queue. Unfortunately, this is the same
      receive_queue the BPF program is reading data off of. So we get a race. Its
      possible for the recvmsg() hook to pull data off the receive_queue before the
      BPF hook has a chance to read it. It typically happens when an application is
      banging on recv() and getting EAGAINs. Until they manage to race with the RX
      BPF program.
      
      To fix this we note that before this patch at attach time when the socket is
      loaded into the map we check if it needs a TX program or just the base set of
      proto bpf hooks. Then it uses the above general RX hook regardless of if we
      have a BPF program attached at rx or not. This patch now extends this check to
      handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
      race when an RX program is attached we use a new hook that is nearly identical
      to the old one except now we do not let the recv() call skip the RX BPF program.
      Now only the BPF program pulls data from sk->receive_queue and recv() only
      pulls data from the ingress msgQ post BPF program handling.
      
      With this resolved our AppB from above has been up and running for many hours
      without detecting any errors. We do this by correlating counters in RX BPF
      events and the AppB to ensure data is never skipping the BPF program. Selftests,
      was not able to detect this because we only run them for a short period of time
      on well ordered send/recvs so we don't get any of the noise we see in real
      application environments.
      
      Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJussi Maki <joamaki@gmail.com>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
      (cherry picked from commit c5d2177a)
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      
       Conflicts:
      	net/ipv4/tcp_bpf.c
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      9a9749fb
  2. 26 4月, 2023 22 次提交
  3. 19 4月, 2023 4 次提交