1. 12 6月, 2023 3 次提交
  2. 09 6月, 2023 2 次提交
    • W
      sched: smart grid: init sched_grid_qos structure on QOS purpose · ce35ded5
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0
      CVE: NA
      
      ----------------------------------------
      
      As smart grid scheduling (SGS) may shrink resources and affect task QOS,
      We provide methods for evaluating task QOS in divided grid, we mainly
      focus on the following two aspects:
      
         1. Evaluate whether (such as CPU or memory) resources meet our demand
         2. Ensure the least impact when working with (cpufreq and cpuidle) governors
      
      For tackling this questions, we have summarized several sampling methods
      to obtain tasks' characteristics at same time reducing scheduling noise
      as much as possible:
      
        1. we detected the key factors that how sensitive a process is in cpufreq
           or cpuidle adjustment, and to guide the cpufreq/cpuidle governor
        2. We dynamically monitor process memory bandwidth and adjust memory
           allocation to minimize cross-remote memory access
        3. We provide a variety of load tracking mechanisms to adapt to different
           types of task's load change
      
           ---------------------------------     -----------------
          |            class A              |   |     class B     |
          |    --------        --------     |   |     --------    |
          |   | group0 |      | group1 |    |---|    | group2 |   |----------+
          |    --------        --------     |   |     --------    |          |
          |    CPU/memory sensitive type    |   |   balance type  |          |
           ----------------+----------------     --------+--------           |
                           v                             v                   | (target cpufreq)
           -------------------------------------------------------           | (sensitivity)
          |              Not satisfied with QOS?                  |          |
           --------------------------+----------------------------           |
                                     v                                       v
           -------------------------------------------------------     ----------------
          |              expand or shrink resource                |<--|  energy model  |
           ----------------------------+--------------------------     ----------------
                                       v                                     |
           -----------          -----------          ------------            v
          |           |        |           |        |            |     ---------------
          |   GRID0   +--------+   GRID1   +--------+   GRID2    |<-- |   governor    |
          |           |        |           |        |            |     ---------------
           -----------          -----------          ------------
                         \            |            /
                          \  -------------------  /
                            |  pages migration  |
                             -------------------
      
      We will introduce the energy model in the follow-up implementation, and consider
      the dynamic affinity adjustment between each divided grid in the runtime.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
      ce35ded5
    • H
      sched: Introduce smart grid scheduling strategy for cfs · 713cfd26
      Hui Tang 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0
      CVE: NA
      
      ----------------------------------------
      
      We want to dynamically expand or shrink the affinity range of tasks
      based on the CPU topology level while meeting the minimum resource
      requirements of tasks.
      
      We divide several level of affinity domains according to sched domains:
      
      level4   * SOCKET  [                                                  ]
      level3   * DIE     [                             ]
      level2   * MC      [             ] [             ]
      level1   * SMT     [     ] [     ] [     ] [     ]
      level0   * CPU      0   1   2   3   4   5   6   7
      
      Whether users tend to choose power saving or performance will affect
      strategy of adjusting affinity, when selecting the power saving mode,
      we will choose a more appropriate affinity based on the energy model
      to reduce power consumption, while considering the QOS of resources
      such as CPU and memory consumption, for instance, if the current task
      CPU load is less than required, smart grid will judge whether to aggregate
      tasks together into a smaller range or not according to energy model.
      
      The main difference from EAS is that we pay more attention to the impact
      of power consumption brought by such as cpuidle and DVFS, and classify
      tasks to reduce interference and ensure resource QOS in each divided unit,
      which are more suitable for general-purpose on non-heterogeneous CPUs.
      
              --------        --------        --------
             | group0 |      | group1 |      | group2 |
              --------        --------        --------
      	   |                |              |
      	   v                |              v
             ---------------------+-----     -----------------
            |                  ---v--   |   |
            |       DIE0      |  MC1 |  |   |   DIE1
            |                  ------   |   |
             ---------------------------     -----------------
      
      We regularly count the resource satisfaction of groups, and adjust the
      affinity, scheduling balance and migrating memory will be considered
      based on memory location for better meetting resource requirements.
      Signed-off-by: NHui Tang <tanghui20@huawei.com>
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
      Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
      Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
      713cfd26
  3. 08 6月, 2023 25 次提交
  4. 07 6月, 2023 6 次提交
    • Z
      block: Fix the partition start may overflow in add_partition() · a325a269
      Zhong Jinghua 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 187268, https://gitee.com/openeuler/kernel/issues/I76JDY
      CVE: NA
      
      ----------------------------------------
      
      In the block_ioctl, we can pass in the unsigned number 0x8000000000000000
      as an input parameter, like below:
      
      block_ioctl
        blkdev_ioctl
          blkpg_ioctl
            blkpg_do_ioctl
              copy_from_user
              bdev_add_partition
                add_partition
                  p->start_sect = start; // start = 0x8000000000000000
      
      Then, there was an warning when submit bio:
      
      WARNING: CPU: 0 PID: 382 at fs/iomap/apply.c:54
      Call trace:
       iomap_apply+0x644/0x6e0
       __iomap_dio_rw+0x5cc/0xa24
       iomap_dio_rw+0x4c/0xcc
       ext4_dio_read_iter
       ext4_file_read_iter
       ext4_file_read_iter+0x318/0x39c
       call_read_iter
       lo_rw_aio.isra.0+0x748/0x75c
       do_req_filebacked+0x2d4/0x370
       loop_handle_cmd
       loop_queue_work+0x94/0x23c
       kthread_worker_fn+0x160/0x6bc
       loop_kthread_worker_fn+0x3c/0x50
       kthread+0x20c/0x25c
       ret_from_fork+0x10/0x18
      
      Stack:
      
      submit_bio_noacct
        submit_bio_checks
          blk_partition_remap
            bio->bi_iter.bi_sector += p->start_sect
            // bio->bi_iter.bi_sector = 0xffc0000000000000 + 65408
      ..
      loop_queue_work
       loop_handle_cmd
        do_req_filebacked
         pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset // pos < 0
         lo_rw_aio
           call_read_iter
            ext4_dio_read_iter
             __iomap_dio_rw
              iomap_apply
               ext4_iomap_begin
                 map.m_lblk = offset >> blkbits
                   ext4_set_iomap
                   iomap->offset = (u64) map->m_lblk << blkbits
                   // iomap->offset = 64512
               WARN_ON(iomap.offset > pos) // iomap.offset = 64512 and pos < 0
      
      This is unreasonable for start + length > disk->part0.nr_sects. There is
      already a similar check in blk_add_partition().
      Fix it by adding a check in bdev_add_partition().
      Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      a325a269
    • C
      block: refactor blkpg_ioctl · ad0628d3
      Christoph Hellwig 提交于
      mainline inclusion
      from mainline-v5.8-rc1
      commit fa9156ae
      category: bugfix
      bugzilla: 187268, https://gitee.com/openeuler/kernel/issues/I76JDY
      CVE: NA
      
      ----------------------------------------
      
      Split each sub-command out into a separate helper, and move those helpers
      to block/partitions/core.c instead of having a lot of partition
      manipulation logic open coded in block/ioctl.c.
      
      Signed-off-by: Christoph Hellwig <hch@lst.de
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      
      conflicts:
      	block/ioctl.c
      	block/partition-generic.c
      	include/linux/genhd.h
      Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      ad0628d3
    • Z
      nbd: get config_lock before sock_shutdown · 4b53fff3
      Zhong Jinghua 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 188799, https://gitee.com/openeuler/kernel/issues/I79QWO
      CVE: NA
      
      ----------------------------------------
      
      Config->socks in sock_shutdown may trigger a UAF problem.
      The reason is that sock_shutdown does not hold the config_lock,
      so that nbd_ioctl can release config->socks at this time.
      
      T0: NBD_SET_SOCK
      T1: NBD_DO_IT
      
      T0						T1
      
      nbd_ioctl
        mutex_lock(&nbd->config_lock)
        // get lock
        __nbd_ioctl
      	nbd_start_device_ioctl
      	  nbd_start_device
      	  mutex_unlock(&nbd->config_lock)
      	  // relase lock
      	  wait_event_interruptible
      	  (kill, enter sock_shutdown)
      	  sock_shutdown
      					nbd_ioctl
      					  mutex_lock(&nbd->config_lock)
      					  // get lock
      					  __nbd_ioctl
      					    nbd_add_socket
      					      krealloc
      						kfree(p)
      					        //config->socks is NULL
      	    nbd_sock *nsock = config->socks // error
      
      Fix it by moving config_lock up before sock_shutdown.
      Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
      Reviewed-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      4b53fff3
    • D
      ipv6: sr: fix out-of-bounds read when setting HMAC data. · d9d1ff02
      David Lebrun 提交于
      stable inclusion
      from stable-v4.19.258
      commit f684c16971ed5e77dfa25a9ad25b5297e1f58eab
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I7ASU6
      CVE: CVE-2023-2860
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f684c16971ed5e77dfa25a9ad25b5297e1f58eab
      
      --------------------------------
      
      [ Upstream commit 84a53580 ]
      
      The SRv6 layer allows defining HMAC data that can later be used to sign IPv6
      Segment Routing Headers. This configuration is realised via netlink through
      four attributes: SEG6_ATTR_HMACKEYID, SEG6_ATTR_SECRET, SEG6_ATTR_SECRETLEN and
      SEG6_ATTR_ALGID. Because the SECRETLEN attribute is decoupled from the actual
      length of the SECRET attribute, it is possible to provide invalid combinations
      (e.g., secret = "", secretlen = 64). This case is not checked in the code and
      with an appropriately crafted netlink message, an out-of-bounds read of up
      to 64 bytes (max secret length) can occur past the skb end pointer and into
      skb_shared_info:
      
      Breakpoint 1, seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
      208		memcpy(hinfo->secret, secret, slen);
      (gdb) bt
       #0  seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
       #1  0xffffffff81e012e9 in genl_family_rcv_msg_doit (skb=skb@entry=0xffff88800b1f9f00, nlh=nlh@entry=0xffff88800b1b7600,
          extack=extack@entry=0xffffc90000ba7af0, ops=ops@entry=0xffffc90000ba7a80, hdrlen=4, net=0xffffffff84237580 <init_net>, family=<optimized out>,
          family=<optimized out>) at net/netlink/genetlink.c:731
       #2  0xffffffff81e01435 in genl_family_rcv_msg (extack=0xffffc90000ba7af0, nlh=0xffff88800b1b7600, skb=0xffff88800b1f9f00,
          family=0xffffffff82fef6c0 <seg6_genl_family>) at net/netlink/genetlink.c:775
       #3  genl_rcv_msg (skb=0xffff88800b1f9f00, nlh=0xffff88800b1b7600, extack=0xffffc90000ba7af0) at net/netlink/genetlink.c:792
       #4  0xffffffff81dfffc3 in netlink_rcv_skb (skb=skb@entry=0xffff88800b1f9f00, cb=cb@entry=0xffffffff81e01350 <genl_rcv_msg>)
          at net/netlink/af_netlink.c:2501
       #5  0xffffffff81e00919 in genl_rcv (skb=0xffff88800b1f9f00) at net/netlink/genetlink.c:803
       #6  0xffffffff81dff6ae in netlink_unicast_kernel (ssk=0xffff888010eec800, skb=0xffff88800b1f9f00, sk=0xffff888004aed000)
          at net/netlink/af_netlink.c:1319
       #7  netlink_unicast (ssk=ssk@entry=0xffff888010eec800, skb=skb@entry=0xffff88800b1f9f00, portid=portid@entry=0, nonblock=<optimized out>)
          at net/netlink/af_netlink.c:1345
       #8  0xffffffff81dff9a4 in netlink_sendmsg (sock=<optimized out>, msg=0xffffc90000ba7e48, len=<optimized out>) at net/netlink/af_netlink.c:1921
      ...
      (gdb) p/x ((struct sk_buff *)0xffff88800b1f9f00)->head + ((struct sk_buff *)0xffff88800b1f9f00)->end
      $1 = 0xffff88800b1b76c0
      (gdb) p/x secret
      $2 = 0xffff88800b1b76c0
      (gdb) p slen
      $3 = 64 '@'
      
      The OOB data can then be read back from userspace by dumping HMAC state. This
      commit fixes this by ensuring SECRETLEN cannot exceed the actual length of
      SECRET.
      Reported-by: NLucas Leong <wmliang.tw@gmail.com>
      Tested: verified that EINVAL is correctly returned when secretlen > len(secret)
      Fixes: 4f4853dc ("ipv6: sr: implement API to control SR HMAC structure")
      Signed-off-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
      Reviewed-by: NLiu Jian <liujian56@huawei.com>
      Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      d9d1ff02
    • L
      dm: add disk before alloc dax · ee776389
      Li Lingfeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I78SWJ
      CVE: NA
      
      -------------------------------
      
      In dm_create(), alloc_dev() may trigger panic if alloc_dax() fail since
      del_gendisk() will be called with add_disk() wasn't called before.
      
      Call add_disk() before alloc_dax() to avoid it.
      Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      ee776389
    • L
      dm thin: Fix ABBA deadlock by resetting dm_bufio_client · c7ca9c06
      Li Lingfeng 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I79ZEK
      CVE: NA
      
      --------------------------------
      
      As described in commit 273494b2 ("dm thin: Fix ABBA deadlock between
      shrink_slab and dm_pool_abort_metadata"), ABBA deadlock will be triggered
      since shrinker_rwsem need to be held when operations failed on dm pool
      metadata.
      
      We have noticed the following three problem scenarios:
      1) Described by commit 273494b2 ("dm thin: Fix ABBA deadlock between
      shrink_slab and dm_pool_abort_metadata")
      
      2) shrinker_rwsem and throttle->lock
                P1(drop cache)                        P2(kworker)
      drop_caches_sysctl_handler
       drop_slab
        shrink_slab
         down_read(&shrinker_rwsem)  - LOCK A
         do_shrink_slab
          super_cache_scan
           prune_icache_sb
            dispose_list
             evict
              ext4_evict_inode
               ext4_clear_inode
                ext4_discard_preallocations
                 ext4_mb_load_buddy_gfp
                  ext4_mb_init_cache
                   ext4_wait_block_bitmap
                    __ext4_error
                     ext4_handle_error
                      ext4_commit_super
                       ...
                       dm_submit_bio
                                           do_worker
                                            throttle_work_update
                                             down_write(&t->lock) -- LOCK B
                                            process_deferred_bios
                                             commit
                                              metadata_operation_failed
                                               dm_pool_abort_metadata
                                                dm_block_manager_create
                                                 dm_bufio_client_create
                                                  register_shrinker
                                                   down_write(&shrinker_rwsem)
                                                   -- LOCK A
                       thin_map
                        thin_bio_map
                         thin_defer_bio_with_throttle
                          throttle_lock
                           down_read(&t->lock)  - LOCK B
      
      3) shrinker_rwsem and wait_on_buffer
                P1(drop cache)                            P2(kworker)
      drop_caches_sysctl_handler
       drop_slab
        shrink_slab
         down_read(&shrinker_rwsem)  - LOCK A
         do_shrink_slab
         ...
          ext4_wait_block_bitmap
           __ext4_error
            ext4_handle_error
             jbd2_journal_abort
              jbd2_journal_update_sb_errno
               jbd2_write_superblock
                submit_bh
                 // LOCK B
                 // RELEASE B
                                   do_worker
                                    throttle_work_update
                                     down_write(&t->lock) - LOCK B
                                    process_deferred_bios
                                     process_bio
                                     commit
                                      metadata_operation_failed
                                       dm_pool_abort_metadata
                                        dm_block_manager_create
                                         dm_bufio_client_create
                                          register_shrinker
                                           register_shrinker_prepared
                                            down_write(&shrinker_rwsem)  - LOCK A
                                     bio_endio
            wait_on_buffer
             __wait_on_buffer
      
      Fix these by resetting dm_bufio_client without holding shrinker_rwsem.
      Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      c7ca9c06
  5. 06 6月, 2023 4 次提交