Merge branch '3.0' into test/chr/TD-14699

1a8a0ebb · haoranc · GitHub · 725fa461 · 1a08229c · 1a8a0ebb
205 changed file
--- a/.gitmodules
+++ b/.gitmodules
@@ -21,4 +21,4 @@
 	url = https://github.com/taosdata/taosadapter.git
 [submodule "tools/taosws-rs"]
 	path = tools/taosws-rs
-	url = https://github.com/taosdata/taosws-rs.git
+	url = https://github.com/taosdata/taosws-rs
--- a/docs/en/07-develop/08-cache.md
+++ b/docs/en/07-develop/08-cache.md
 ---
 sidebar_label: Cache
 title: Cache
-description: "The latest row of each table is kept in cache to provide high performance query of latest state."
+description: "Caching System inside TDengine"
 ---

+To achieve the purpose of high performance data writing and querying, TDengine employs a lot of caching technologies in both server side and client side. 
+
+## Write Cache
+
 The cache management policy in TDengine is First-In-First-Out (FIFO). FIFO is also known as insert driven cache management policy and it is different from read driven cache management, which is more commonly known as Least-Recently-Used (LRU). FIFO simply stores the latest data in cache and flushes the oldest data in cache to disk, when the cache usage reaches a threshold. In IoT use cases, it is the current state i.e. the latest or most recent data that is important. The cache policy in TDengine, like much of the design and architecture of TDengine, is based on the nature of IoT data.

-Caching the latest data provides the capability of retrieving data in milliseconds. With this capability, TDengine can be configured properly to be used as a caching system without deploying another separate caching system. This simplifies the system architecture and minimizes operational costs. The cache is emptied after TDengine is restarted. TDengine does not reload data from disk into cache, like a key-value caching system.
+The memory space used by each vnode as write cache is determined when creating a database. Parameter `vgroups` and `buffer` can be used to specify the number of vnode and the size of write cache for each vnode when creating the database. Then, the total size of write cache for this database is `vgroups * buffer`.
+
+```sql
+create database db0 vgroups 100 buffer 16MB
+```
+
+The above statement creates a database of 100 vnodes while each vnode has a write cache of 16MB.
+
+Even though in theory it's always better to have a larger cache, the extra effect would be very minor once the size of cache grows beyond a threshold. So normally it's enough to use the default value of `buffer` parameter.
+
+## Read Cache

-The memory space used by the TDengine cache is fixed in size and configurable. It should be allocated based on application requirements and system resources. An independent memory pool is allocated for and managed by each vnode (virtual node) in TDengine. There is no sharing of memory pools between vnodes. All the tables belonging to a vnode share all the cache memory of the vnode.
+When creating a database, it's also possible to specify whether to cache the latest data of each sub table, using parameter `cachelast`. There are 3 cases:
+- 0: No cache for latest data
+- 1: The last row of each table is cached, `last_row` function can benefit significantly from it
+- 2: The latest non-NULL value of each column for each table is cached, `last` function can benefit very much when there is no `where`, `group by`, `order by` or `interval` clause
+- 3: Bot hthe last row and the latest non-NULL value of each column for each table are cached, identical to the behavior of both 1 and 2 are set together

-The memory pool is divided into blocks and data is stored in row format in memory and each block follows FIFO policy. The size of each block is determined by configuration parameter `cache` and the number of blocks for each vnode is determined by the parameter `blocks`. For each vnode, the total cache size is `cache * blocks`.  A cache block needs to ensure that each table can store at least dozens of records, to be efficient.

-`last_row` function can be used to retrieve the last row of a table or a STable to quickly show the current state of devices on monitoring screen. For example the below SQL statement retrieves the latest voltage of all meters in San Francisco, California.
+## Meta Cache
+
+To process data writing and querying efficiently, each vnode caches the metadata that's already retrieved. Parameters `pages` and `pagesize` are used to specify the size of metadata cache for each vnode.

 ```sql
-select last_row(voltage) from meters where location='California.SanFrancisco';
+create database db0 pages 128 pagesize 16kb
 ```
+
+The above statement will create a database db0 each of whose vnode is allocated a meta cache of `128 * 16 KB = 2 MB` .
+
+## File System Cache
+
+TDengine utilizes WAL to provide basic reliability. The essential of WAL is to append data in a disk file, so the file system cache also plays an important role in the writing performance. Parameter `wal` can be used to specify the policy of writing WAL, there are 2 cases:
+- 1: Write data to WAL without calling fsync, the data is actually written to the file system cache without flushing immediately, in this way you can get better write performance
+- 2: Write data to WAL and invoke fsync, the data is immediately flushed to disk, in this way you can get higher reliability
+
+## Client Cache
+
+To improve the overall efficiency of processing data, besides the above caches, the core library `libtaos.so` (also referred to as `taosc`) which all client programs depend on also has its own cache. `taosc` caches the metadata of the databases, super tables, tables that the invoking client has accessed, plus other critical metadata such as the cluster topology. 
+
+When multiple client programs are accessing a TDengine cluster, if one of the clients modifies some metadata, the cache may become invalid in other clients. If this case happens, the client programs need to "reset query cache" to invalidate the whole cache so that `taosc` is enforced to repull the metadata it needs to rebuild the cache.
--- a/docs/zh/07-develop/08-cache.md
+++ b/docs/zh/07-develop/08-cache.md
 ---
 sidebar_label: 缓存
 title: 缓存
-description: "提供写驱动的缓存管理机制，将每个表最近写入的一条记录持续保存在缓存中，可以提供高性能的最近状态查询。"
+description: "TDengine 内部的缓存设计"
 ---

+为了实现高效的写入和查询，TDengine 充分利用了各种缓存技术，本节将对 TDengine 中对缓存的使用做详细的说明。
+
+## 写缓存
+
 TDengine 采用时间驱动缓存管理策略（First-In-First-Out，FIFO），又称为写驱动的缓存管理机制。这种策略有别于读驱动的数据缓存模式（Least-Recent-Used，LRU），直接将最近写入的数据保存在系统的缓存中。当缓存达到临界值的时候，将最早的数据批量写入磁盘。一般意义上来说，对于物联网数据的使用，用户最为关心最近产生的数据，即当前状态。TDengine 充分利用了这一特性，将最近到达的（当前状态）数据保存在缓存中。

-TDengine 通过查询函数向用户提供毫秒级的数据获取能力。直接将最近到达的数据保存在缓存中，可以更加快速地响应用户针对最近一条或一批数据的查询分析，整体上提供更快的数据库查询响应能力。从这个意义上来说，可通过设置合适的配置参数将 TDengine 作为数据缓存来使用，而不需要再部署额外的缓存系统，可有效地简化系统架构，降低运维的成本。需要注意的是，TDengine 重启以后系统的缓存将被清空，之前缓存的数据均会被批量写入磁盘，缓存的数据将不会像专门的 key-value 缓存系统再将之前缓存的数据重新加载到缓存中。
+每个 vnode 的写入缓存大小在创建数据库时决定，创建数据库时的两个关键参数 vgroups 和 buffer 分别决定了该数据库中的数据由多少个 vgroup 处理，以及向其中的每个 vnode 分配多少写入缓存。
+
+```sql
+create database db0 vgroups 100 buffer 16MB
+```
+
+理论上缓存越大越好，但超过一定阈值后再增加缓存对写入性能提升并无帮助，一般情况下使用默认值即可。

-TDengine 分配固定大小的内存空间作为缓存空间，缓存空间可根据应用的需求和硬件资源配置。通过适当的设置缓存空间，TDengine 可以提供极高性能的写入和查询的支持。TDengine 中每个虚拟节点（virtual node）创建时分配独立的缓存池。每个虚拟节点管理自己的缓存池，不同虚拟节点间不共享缓存池。每个虚拟节点内部所属的全部表共享该虚拟节点的缓存池。
+## 读缓存

-TDengine 将内存池按块划分进行管理，数据在内存块里是以行（row）的形式存储。一个 vnode 的内存池是在 vnode 创建时按块分配好，而且每个内存块按照先进先出的原则进行管理。在创建内存池时，块的大小由系统配置参数 cache 决定；每个 vnode 中内存块的数目则由配置参数 blocks 决定。因此对于一个 vnode，总的内存大小为：`cache * blocks`。一个 cache block 需要保证每张表能存储至少几十条以上记录，才会有效率。
+在创建数据库时可以选择是否缓存该数据库中每个子表的最新数据。由参数 cachelast 设置，分为三种情况：
+- 0: 不缓存
+- 1: 缓存子表最近一行数据，这将显著改善 last_row 函数的性能
+- 2: 缓存子表每一列最近的非 NULL 值，这将显著改善无特殊影响（比如 WHERE, ORDER BY, GROUP BY, INTERVAL）时的 last 函数的性能
+- 3: 同时缓存行和列，即等同于上述 cachelast 值为 1 或 2 时的行为同时生效

-你可以通过函数 last_row() 快速获取一张表或一张超级表的最后一条记录，这样很便于在大屏显示各设备的实时状态或采集值。例如：
+## 元数据缓存
+
+为了更高效地处理查询和写入，每个 vnode 都会缓存自己曾经获取到的元数据。元数据缓存由创建数据库时的两个参数 pages 和 pagesize 决定。

 ```sql
-select last_row(voltage) from meters where location='California.SanFrancisco';
+create database db0 pages 128 pagesize 16kb
 ```

-该 SQL 语句将获取所有位于加利福尼亚州旧金山市的电表最后记录的电压值。
+上述语句会为数据库 db0 的每个 vnode 创建 128 个 page，每个 page 16kb 的元数据缓存。
+
+## 文件系统缓存
+
+TDengine 利用 WAL 技术来提供基本的数据可靠性。写入 WAL 本质上是以顺序追加的方式写入磁盘文件。此时文件系统缓存在写入性能中也会扮演关键角色。在创建数据库时可以利用 wal 参数来选择性能优先或者可靠性优先。
+- 1: 写 WAL 但不执行 fsync ，新写入 WAL 的数据保存在文件系统缓存中但并未写入磁盘，这种方式性能优先
+- 2: 写 WAL 且执行 fsync，新写入 WAL 的数据被立即同步到磁盘上，可靠性更高
+
+## 客户端缓存
+
+为了进一步提升整个系统的处理效率，除了以上提到的服务端缓存技术之外，在 TDengine 的所有客户端都要调用的核心库 libtaos.so （也称为 taosc ）中也充分利用了缓存技术。在 taosc 中会缓存所访问过的各个数据库、超级表以及子表的元数据，集群的拓扑结构等关键元数据。
+
+当有多个客户端同时访问 TDengine 集群，且其中一个客户端对某些元数据进行了修改的情况下，有可能会出现其它客户端所缓存的元数据不同步或失效的情况，此时需要在客户端执行 "reset query cache" 以让整个缓存失效从而强制重新拉取最新的元数据重新建立缓存。
--- a/docs/zh/10-cluster/01-deploy.md
+++ b/docs/zh/10-cluster/01-deploy.md
@@ -10,7 +10,7 @@ title: 集群部署

 ### 第一步

-如果搭建集群的物理节点中，存有之前的测试数据、装过 1.X 的版本，或者装过其他版本的 TDengine，请先将其删除，并清空所有数据（如果需要保留原有数据，请联系涛思交付团队进行旧版本升级、数据迁移），具体步骤请参考博客[《TDengine 多种安装包的安装和卸载》](https://www.taosdata.com/blog/2019/08/09/566.html)。
+如果搭建集群的物理节点中，存有之前的测试数据，或者装过其他版本的 TDengine，请先将其删除，并清空所有数据（如果需要保留原有数据，请联系涛思交付团队进行旧版本升级、数据迁移），具体步骤请参考博客[《TDengine 多种安装包的安装和卸载》](https://www.taosdata.com/blog/2019/08/09/566.html)。

 :::note
 因为 FQDN 的信息会写进文件，如果之前没有配置或者更改 FQDN，且启动了 TDengine。请一定在确保数据无用或者备份的前提下，清理一下之前的数据（rm -rf /var/lib/taos/\*）；
@@ -54,30 +54,16 @@ fqdn                  h1.taosdata.com
 // 配置本数据节点的端口号，缺省是 6030
 serverPort            6030

-// 副本数为偶数的时候，需要配置，请参考《Arbitrator 的使用》的部分
-arbitrator            ha.taosdata.com:6042
-```
-
 一定要修改的参数是 firstEp 和 fqdn。在每个数据节点，firstEp 需全部配置成一样，但 fqdn 一定要配置成其所在数据节点的值。其他参数可不做任何修改，除非你很清楚为什么要修改。

-加入到集群中的数据节点 dnode，涉及集群相关的下表 9 项参数必须完全相同，否则不能成功加入到集群中。
+加入到集群中的数据节点 dnode，下表中涉及集群相关的参数必须完全相同，否则不能成功加入到集群中。

 | **#** | **配置参数名称**   | **含义**                                    |
 | ----- | ------------------ | ------------------------------------------- |
-| 1     | numOfMnodes        | 系统中管理节点个数                          |
-| 2     | mnodeEqualVnodeNum | 一个 mnode 等同于 vnode 消耗的个数          |
-| 3     | offlineThreshold   | dnode 离线阈值，超过该时间将导致 Dnode 离线 |
-| 4     | statusInterval     | dnode 向 mnode 报告状态时长                 |
-| 5     | arbitrator         | 系统中裁决器的 End Point                    |
-| 6     | timezone           | 时区                                        |
-| 7     | balance            | 是否启动负载均衡                            |
-| 8     | maxTablesPerVnode  | 每个 vnode 中能够创建的最大表个数           |
-| 9     | maxVgroupsPerDb    | 每个 DB 中能够使用的最大 vgroup 个数        |
-
-:::note
-在 2.0.19.0 及更早的版本中，除以上 9 项参数外，dnode 加入集群时，还会要求 locale 和 charset 参数的取值也一致。
-
-:::
+| 1     | statusInterval     | dnode 向 mnode 报告状态时长                 |
+| 2     | timezone           | 时区                                        |
+| 3     | locale             | 系统区位信息及编码格式                       |
+| 4     | charset            | 字符集编码                                 |

 ## 启动集群


--- a/docs/zh/10-cluster/02-cluster-mgmt.md
+++ b/docs/zh/10-cluster/02-cluster-mgmt.md
@@ -24,15 +24,15 @@ SHOW DNODES;

 ```
 taos> show dnodes;
-   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
-======================================================================================================================================
-      1 | localhost:6030                 |      9 |      8 | ready      | any   | 2022-04-15 08:27:09.359 |                          |
-Query OK, 1 row(s) in set (0.008298s)
+   id   |            endpoint            | vnodes | support_vnodes |   status   |       create_time       |              note              |
+============================================================================================================================================
+      1 | trd01:6030                     |    100 |           1024 | ready      | 2022-07-15 16:47:47.726 |                                |
+Query OK, 1 rows affected (0.006684s)
 ```

 ## 查看虚拟节点组

-为充分利用多核技术，并提供 scalability，数据需要分片处理。因此 TDengine 会将一个 DB 的数据切分成多份，存放在多个 vnode 里。这些 vnode 可能分布在多个数据节点 dnode 里，这样就实现了水平扩展。一个 vnode 仅仅属于一个 DB，但一个 DB 可以有多个 vnode。vnode 所在的数据节点是 mnode 根据当前系统资源的情况，自动进行分配的，无需任何人工干预。
+为充分利用多核技术，并提供横向扩展能力，数据需要分片处理。因此 TDengine 会将一个 DB 的数据切分成多份，存放在多个 vnode 里。这些 vnode 可能分布在多个数据节点 dnode 里，这样就实现了水平扩展。一个 vnode 仅仅属于一个 DB，但一个 DB 可以有多个 vnode。vnode 所在的数据节点是 mnode 根据当前系统资源的情况，自动进行分配的，无需任何人工干预。

 启动 CLI 程序 taos，然后执行：

@@ -44,26 +44,15 @@ SHOW VGROUPS;
 输出如下（具体内容仅供参考，取决于实际的集群配置）

 ```
-taos> show dnodes;
-   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
-======================================================================================================================================
-      1 | localhost:6030                 |      9 |      8 | ready      | any   | 2022-04-15 08:27:09.359 |                          |
-Query OK, 1 row(s) in set (0.008298s)
-
 taos> use db;
 Database changed.

 taos> show vgroups;
-    vgId     |   tables    |  status  |   onlines   | v1_dnode | v1_status | compacting  |
-==========================================================================================
-          14 |       38000 | ready    |           1 |        1 | master    |           0 |
-          15 |       38000 | ready    |           1 |        1 | master    |           0 |
-          16 |       38000 | ready    |           1 |        1 | master    |           0 |
-          17 |       38000 | ready    |           1 |        1 | master    |           0 |
-          18 |       37001 | ready    |           1 |        1 | master    |           0 |
-          19 |       37000 | ready    |           1 |        1 | master    |           0 |
-          20 |       37000 | ready    |           1 |        1 | master    |           0 |
-          21 |       37000 | ready    |           1 |        1 | master    |           0 |
+  vgroup_id  |            db_name             |   tables    |  v1_dnode   | v1_status  |  v2_dnode   | v2_status  |  v3_dnode   | v3_status  |    status    |   nfiles    |  file_size  | tsma |
+================================================================================================================================================================================================
+           2 | db                             |           0 |           1 | leader     |        NULL | NULL       |        NULL | NULL       | NULL         |        NULL |        NULL |    0 |
+           3 | db                             |           0 |           1 | leader     |        NULL | NULL       |        NULL | NULL       | NULL         |        NULL |        NULL |    0 |
+           4 | db                             |           0 |           1 | leader     |        NULL | NULL       |        NULL | NULL       | NULL         |        NULL |        NULL |    0 |
 Query OK, 8 row(s) in set (0.001154s)
 ```

@@ -77,35 +66,21 @@ CREATE DNODE "fqdn:port";

 将新数据节点的 End Point 添加进集群的 EP 列表。“fqdn:port“需要用双引号引起来，否则出错。一个数据节点对外服务的 fqdn 和 port 可以通过配置文件 taos.cfg 进行配置，缺省是自动获取。【强烈不建议用自动获取方式来配置 FQDN，可能导致生成的数据节点的 End Point 不是所期望的】

-示例如下：
-```
-taos> create dnode "localhost:7030";
-Query OK, 0 of 0 row(s) in database (0.008203s)
-
-taos> show dnodes;
-   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
-======================================================================================================================================
-      1 | localhost:6030                 |      9 |      8 | ready      | any   | 2022-04-15 08:27:09.359 |                          |
-      2 | localhost:7030                 |      0 |      0 | offline    | any   | 2022-04-19 08:11:42.158 | status not received      |
-Query OK, 2 row(s) in set (0.001017s)
-```
-
-在上面的示例中可以看到新创建的 dnode 的状态为 offline，待该 dnode 被启动并连接上配置文件中指定的 firstEp后再次查看，得到如下结果（示例）
+然后启动新加入的数据节点的 taosd 进程，再通过 taos 查看数据节点状态：

 ```
 taos> show dnodes;
-   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
-======================================================================================================================================
-      1 | localhost:6030                 |      3 |      8 | ready      | any   | 2022-04-15 08:27:09.359 |                          |
-      2 | localhost:7030                 |      6 |      8 | ready      | any   | 2022-04-19 08:14:59.165 |                          |
-Query OK, 2 row(s) in set (0.001316s)
+   id   |            endpoint            | vnodes | support_vnodes |   status   |       create_time       |              note              |
+============================================================================================================================================
+      1 | trd01:6030                     |    100 |           1024 | ready      | 2022-07-15 16:47:47.726 |                                |
+      2 | trd04:6030                     |      0 |           1024 | ready      | 2022-07-15 16:56:13.670 |                                |
+Query OK, 2 rows affected (0.007031s)
 ```
 从中可以看到两个 dnode 状态都为 ready

-
 ## 删除数据节点

-启动 CLI 程序 taos，然后执行：
+先停止要删除的数据节点的 taosd 进程，然后启动 CLI 程序 taos，执行：

 ```sql
 DROP DNODE "fqdn:port";
@@ -117,26 +92,6 @@ DROP DNODE dnodeId;

 通过 “fqdn:port” 或 dnodeID 来指定一个具体的节点都是可以的。其中 fqdn 是被删除的节点的 FQDN，port 是其对外服务器的端口号；dnodeID 可以通过 SHOW DNODES 获得。

-示例如下：
-```
-taos> show dnodes;
-   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
-======================================================================================================================================
-      1 | localhost:6030                 |      9 |      8 | ready      | any   | 2022-04-15 08:27:09.359 |                          |
-      2 | localhost:7030                 |      0 |      0 | offline    | any   | 2022-04-19 08:11:42.158 | status not received      |
-Query OK, 2 row(s) in set (0.001017s)
-
-taos> drop dnode 2;
-Query OK, 0 of 0 row(s) in database (0.000518s)
-
-taos> show dnodes;
-   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
-======================================================================================================================================
-      1 | localhost:6030                 |      9 |      8 | ready      | any   | 2022-04-15 08:27:09.359 |                          |
-Query OK, 1 row(s) in set (0.001137s)
-```
-
-上面的示例中，初次执行 `show dnodes` 列出了两个 dnode, 执行 `drop dnode 2` 删除其中 ID 为 2 的 dnode 之后再次执行 `show dnodes`，可以看到只剩下 ID 为 1 的 dnode 。

 :::warning

@@ -147,70 +102,4 @@ dnodeID 是集群自动分配的，不得人工指定。它在生成时是递增

 :::

-## 手动迁移数据节点
-
-手动将某个 vnode 迁移到指定的 dnode。
-
-启动 CLI 程序 taos，然后执行：
-
-```sql
-ALTER DNODE <source-dnodeId> BALANCE "VNODE:<vgId>-DNODE:<dest-dnodeId>";
-```
-
-其中：source-dnodeId 是源 dnodeId，也就是待迁移的 vnode 所在的 dnodeID；vgId 可以通过 SHOW VGROUPS 获得，列表的第一列；dest-dnodeId 是目标 dnodeId。
-
-首先执行 `show vgroups` 查看 vgroup 的分布情况 
-```
-taos> show vgroups;
-    vgId     |   tables    |  status  |   onlines   | v1_dnode | v1_status | compacting  |
-==========================================================================================
-          14 |       38000 | ready    |           1 |        3 | master    |           0 |
-          15 |       38000 | ready    |           1 |        3 | master    |           0 |
-          16 |       38000 | ready    |           1 |        3 | master    |           0 |
-          17 |       38000 | ready    |           1 |        3 | master    |           0 |
-          18 |       37001 | ready    |           1 |        3 | master    |           0 |
-          19 |       37000 | ready    |           1 |        1 | master    |           0 |
-          20 |       37000 | ready    |           1 |        1 | master    |           0 |
-          21 |       37000 | ready    |           1 |        1 | master    |           0 |
-Query OK, 8 row(s) in set (0.001314s)
-```
-
-从中可以看到在 dnode 3 中有5个 vgroup，而 dnode 1 有 3 个 vgroup，假定我们想将其中 vgId 为18 的 vgroup 从 dnode 3 迁移到 dnode 1
-
-```
-taos> alter dnode 3 balance "vnode:18-dnode:1";
-
-DB error: Balance already enabled (0.00755
-```
-
-上面的结果表明目前所在数据库已经启动了 balance 选项，所以无法进行手动迁移。
-
-停止整个集群，将两个 dnode 的配置文件中的 balance 都设置为 0 （默认为1）之后，重新启动集群，再次执行 ` alter dnode` 和 `show vgroups` 命令如下
-```
-taos> alter dnode 3 balance "vnode:18-dnode:1";
-Query OK, 0 row(s) in set (0.000575s)
-
-taos> show vgroups;
-    vgId     |   tables    |  status  |   onlines   | v1_dnode | v1_status | v2_dnode | v2_status | compacting  |
-=================================================================================================================
-          14 |       38000 | ready    |           1 |        3 | master    |        0 | NULL      |           0 |
-          15 |       38000 | ready    |           1 |        3 | master    |        0 | NULL      |           0 |
-          16 |       38000 | ready    |           1 |        3 | master    |        0 | NULL      |           0 |
-          17 |       38000 | ready    |           1 |        3 | master    |        0 | NULL      |           0 |
-          18 |       37001 | ready    |           2 |        1 | slave     |        3 | master    |           0 |
-          19 |       37000 | ready    |           1 |        1 | master    |        0 | NULL      |           0 |
-          20 |       37000 | ready    |           1 |        1 | master    |        0 | NULL      |           0 |
-          21 |       37000 | ready    |           1 |        1 | master    |        0 | NULL      |           0 |
-Query OK, 8 row(s) in set (0.001242s)
-```
-
-从上面的输出可以看到 vgId 为 18 的 vnode 被从 dnode 3 迁移到了 dnode 1。
-
-:::warning
-
-只有在集群的自动负载均衡选项关闭时（balance 设置为 0），才允许手动迁移。
-只有处于正常工作状态的 vnode 才能被迁移：master/slave；当处于 offline/unsynced/syncing 状态时，是不能迁移的。
-迁移前，务必核实目标 dnode 的资源足够：CPU、内存、硬盘。
-
-:::

--- a/docs/zh/10-cluster/03-ha-and-lb.md
+++ b/docs/zh/10-cluster/03-ha-and-lb.md
---
-title: 高可用与负载均衡
---
-
-## Vnode 的高可用性
-
-TDengine 通过多副本的机制来提供系统的高可用性，包括 vnode 和 mnode 的高可用性。
-
-vnode 的副本数是与 DB 关联的，一个集群里可以有多个 DB，根据运营的需求，每个 DB 可以配置不同的副本数。创建数据库时，通过参数 replica 指定副本数（缺省为 1）。如果副本数为 1，系统的可靠性无法保证，只要数据所在的节点宕机，就将无法提供服务。集群的节点数必须大于等于副本数，否则创建表时将返回错误“more dnodes are needed”。比如下面的命令将创建副本数为 3 的数据库 demo：
-
-```sql
-CREATE DATABASE demo replica 3;
-```
-
-一个 DB 里的数据会被切片分到多个 vnode group，vnode group 里的 vnode 数目就是 DB 的副本数，同一个 vnode group 里各 vnode 的数据是完全一致的。为保证高可用性，vnode group 里的 vnode 一定要分布在不同的数据节点 dnode 里（实际部署时，需要在不同的物理机上），只要一个 vnode group 里超过半数的 vnode 处于工作状态，这个 vnode group 就能正常的对外服务。
-
-一个数据节点 dnode 里可能有多个 DB 的数据，因此一个 dnode 离线时，可能会影响到多个 DB。如果一个 vnode group 里的一半或一半以上的 vnode 不工作，那么该 vnode group 就无法对外服务，无法插入或读取数据，这样会影响到它所属的 DB 的一部分表的读写操作。
-
-因为 vnode 的引入，无法简单地给出结论：“集群中过半数据节点 dnode 工作，集群就应该工作”。但是对于简单的情形，很好下结论。比如副本数为 3，只有三个 dnode，那如果仅有一个节点不工作，整个集群还是可以正常工作的，但如果有两个数据节点不工作，那整个集群就无法正常工作了。
-
-## Mnode 的高可用性
-
-TDengine 集群是由 mnode（taosd 的一个模块，管理节点）负责管理的，为保证 mnode 的高可用，可以配置多个 mnode 副本，副本数由系统配置参数 numOfMnodes 决定，有效范围为 1-3。为保证元数据的强一致性，mnode 副本之间是通过同步的方式进行数据复制的。
-
-一个集群有多个数据节点 dnode，但一个 dnode 至多运行一个 mnode 实例。多个 dnode 情况下，哪个 dnode 可以作为 mnode 呢？这是完全由系统根据整个系统资源情况，自动指定的。用户可通过 CLI 程序 taos，在 TDengine 的 console 里，执行如下命令：
-
-```sql
-SHOW MNODES;
-```
-
-来查看 mnode 列表，该列表将列出 mnode 所处的 dnode 的 End Point 和角色（master，slave，unsynced 或 offline）。当集群中第一个数据节点启动时，该数据节点一定会运行一个 mnode 实例，否则该数据节点 dnode 无法正常工作，因为一个系统是必须有至少一个 mnode 的。如果 numOfMnodes 配置为 2，启动第二个 dnode 时，该 dnode 也将运行一个 mnode 实例。
-
-为保证 mnode 服务的高可用性，numOfMnodes 必须设置为 2 或更大。因为 mnode 保存的元数据必须是强一致的，如果 numOfMnodes 大于 2，复制参数 quorum 自动设为 2，也就是说，至少要保证有两个副本写入数据成功，才通知客户端应用写入成功。
-
-:::note
-一个 TDengine 高可用系统，无论是 vnode 还是 mnode，都必须配置多个副本。
-
-:::
-
-## 负载均衡
-
-有三种情况，将触发负载均衡，而且都无需人工干预。
-
-当一个新数据节点添加进集群时，系统将自动触发负载均衡，一些节点上的数据将被自动转移到新数据节点上，无需任何人工干预。
-当一个数据节点从集群中移除时，系统将自动把该数据节点上的数据转移到其他数据节点，无需任何人工干预。
-如果一个数据节点过热（数据量过大），系统将自动进行负载均衡，将该数据节点的一些 vnode 自动挪到其他节点。
-当上述三种情况发生时，系统将启动各个数据节点的负载计算，从而决定如何挪动。
-
-:::tip
-负载均衡由参数 balance 控制，它决定是否启动自动负载均衡，0 表示禁用，1 表示启用自动负载均衡。
-
-:::
-
-## 数据节点离线处理
-
-如果一个数据节点离线，TDengine 集群将自动检测到。有如下两种情况：
-
-该数据节点离线超过一定时间（taos.cfg 里配置参数 offlineThreshold 控制时长），系统将自动把该数据节点删除，产生系统报警信息，触发负载均衡流程。如果该被删除的数据节点重新上线时，它将无法加入集群，需要系统管理员重新将其添加进集群才会开始工作。
-
-离线后，在 offlineThreshold 的时长内重新上线，系统将自动启动数据恢复流程，等数据完全恢复后，该节点将开始正常工作。
-
-:::note
-如果一个虚拟节点组（包括 mnode 组）里所归属的每个数据节点都处于离线或 unsynced 状态，必须等该虚拟节点组里的所有数据节点都上线、都能交换状态信息后，才能选出 Master，该虚拟节点组才能对外提供服务。比如整个集群有 3 个数据节点，副本数为 3，如果 3 个数据节点都宕机，然后 2 个数据节点重启，是无法工作的，只有等 3 个数据节点都重启成功，才能对外服务。
-
-:::
-
-## Arbitrator 的使用
-
-如果副本数为偶数，当一个 vnode group 里一半或超过一半的 vnode 不工作时，是无法从中选出 master 的。同理，一半或超过一半的 mnode 不工作时，是无法选出 mnode 的 master 的，因为存在“split brain”问题。
-
-为解决这个问题，TDengine 引入了 Arbitrator 的概念。Arbitrator 模拟一个 vnode 或 mnode 在工作，但只简单的负责网络连接，不处理任何数据插入或访问。只要包含 Arbitrator 在内，超过半数的 vnode 或 mnode 工作，那么该 vnode group 或 mnode 组就可以正常的提供数据插入或查询服务。比如对于副本数为 2 的情形，如果一个节点 A 离线，但另外一个节点 B 正常，而且能连接到 Arbitrator，那么节点 B 就能正常工作。
-
-总之，在目前版本下，TDengine 建议在双副本环境要配置 Arbitrator，以提升系统的可用性。
-
-Arbitrator 的执行程序名为 tarbitrator。该程序对系统资源几乎没有要求，只需要保证有网络连接，找任何一台 Linux 服务器运行它即可。以下简要描述安装配置的步骤：
-
-请点击 安装包下载，在 TDengine Arbitrator Linux 一节中，选择合适的版本下载并安装。
-该应用的命令行参数 -p 可以指定其对外服务的端口号，缺省是 6042。
-
-修改每个 taosd 实例的配置文件，在 taos.cfg 里将参数 arbitrator 设置为 tarbitrator 程序所对应的 End Point。（如果该参数配置了，当副本数为偶数时，系统将自动连接配置的 Arbitrator。如果副本数为奇数，即使配置了 Arbitrator，系统也不会去建立连接。）
-
-在配置文件中配置了的 Arbitrator，会出现在 SHOW DNODES 指令的返回结果中，对应的 role 列的值会是“arb”。
-查看集群 Arbitrator 的状态【2.0.14.0 以后支持】
-
-```sql
-SHOW DNODES;
-```
--- a/docs/zh/10-cluster/03-high-availability.md
+++ b/docs/zh/10-cluster/03-high-availability.md
+---
+title: 高可用
+---
+
+## Vnode 的高可用性
+
+TDengine 通过多副本的机制来提供系统的高可用性，包括 vnode 和 mnode 的高可用性。
+
+vnode 的副本数是与 DB 关联的，一个集群里可以有多个 DB，根据运营的需求，每个 DB 可以配置不同的副本数。创建数据库时，通过参数 replica 指定副本数（缺省为 1）。如果副本数为 1，系统的可靠性无法保证，只要数据所在的节点宕机，就将无法提供服务。集群的节点数必须大于等于副本数，否则创建表时将返回错误“more dnodes are needed”。比如下面的命令将创建副本数为 3 的数据库 demo：
+
+```sql
+CREATE DATABASE demo replica 3;
+```
+
+一个 DB 里的数据会被切片分到多个 vnode group，vnode group 里的 vnode 数目就是 DB 的副本数，同一个 vnode group 里各 vnode 的数据是完全一致的。为保证高可用性，vnode group 里的 vnode 一定要分布在不同的数据节点 dnode 里（实际部署时，需要在不同的物理机上），只要一个 vnode group 里超过半数的 vnode 处于工作状态，这个 vnode group 就能正常的对外服务。
+
+一个数据节点 dnode 里可能有多个 DB 的数据，因此一个 dnode 离线时，可能会影响到多个 DB。如果一个 vnode group 里的一半或一半以上的 vnode 不工作，那么该 vnode group 就无法对外服务，无法插入或读取数据，这样会影响到它所属的 DB 的一部分表的读写操作。
+
+因为 vnode 的引入，无法简单地给出结论：“集群中过半数据节点 dnode 工作，集群就应该工作”。但是对于简单的情形，很好下结论。比如副本数为 3，只有三个 dnode，那如果仅有一个节点不工作，整个集群还是可以正常工作的，但如果有两个数据节点不工作，那整个集群就无法正常工作了。
+
+## Mnode 的高可用性
+
+TDengine 集群是由 mnode（taosd 的一个模块，管理节点）负责管理的，为保证 mnode 的高可用，可以配置多个 mnode 副本，在集群启动时只有一个 mnode，用户可以通过 `create mnode` 来增加新的 mnode。用户可以通过该命令自主决定哪几个 dnode 会承担 mnode 的角色。为保证元数据的强一致性，在有多个 mnode 时，mnode 副本之间是通过同步的方式进行数据复制的。
+
+一个集群有多个数据节点 dnode，但一个 dnode 至多运行一个 mnode 实例。用户可通过 CLI 程序 taos，在 TDengine 的 console 里，执行如下命令：
+
+```sql
+SHOW MNODES;
+```
+
+来查看 mnode 列表，该列表将列出 mnode 所处的 dnode 的 End Point 和角色（leader, follower, candidate）。当集群中第一个数据节点启动时，该数据节点一定会运行一个 mnode 实例，否则该数据节点 dnode 无法正常工作，因为一个系统是必须有至少一个 mnode 的。
+
+在 TDengine 3.0 及以后的版本中，数据同步采用 RAFT 协议，所以 mnode 的数量应该被设置为 1 个或者 3 个。
--- a/docs/zh/10-cluster/04-load-balance.md
+++ b/docs/zh/10-cluster/04-load-balance.md
+---
+title: 负载均衡
+---
+
+TDengine 中的负载均衡主要指对时序数据的处理的负载均衡。TDengine 采用 Hash 一致性算法将一个数据库中的所有表和子表的数据均衡分散在属于该数据库的所有 vgroups 中，每张表或子表只能由一个 vgroups 处理，一个 vgroups 可能负责处理多个表或子表。
+
+创建数据库时可以指定其中的 vgroups 的数量：
+
+```sql
+create database db0 vgroups 100;
+```
+
+如何指定合适的 vgroups 的数量，这取决于系统资源。假定系统中只计划建立一个数据库，则 vgroups 由集群中所有 dnode 所能使用的资源决定。原则上可用的 CPU 和 Memory 越多，可建立的 vgroups 也越多。但也要考虑到磁盘性能，过多的 vgroups 在磁盘性能达到上限后反而会拖累整个系统的性能。假如系统中会建立多个数据库，则多个数据库的 vgoups 之和取决于系统中可用资源的数量。要综合考虑多个数据库之间表的数量、写入频率、数据量等多个因素在多个数据库之间分配 vgroups。实际中建议首先根据系统资源配置选择一个初始的 vgroups 数量，比如 CPU 总核数的 2 倍，以此为起点通过测试找到最佳的 vgroups 数量配置，此为系统中的 vgroups 总数。如果有多个数据库的话，再根据各个数据库的表数和数据量对 vgroups 进行分配。
+
+此外，对于任意数据库的 vgroups，TDengine 都是尽可能将其均衡分散在多个 dnode 上。在多副本情况下（replica 3），这种均衡分布尤其复杂，TDengine 的分布策略会尽量避免任意一个 dnode 成为写入的瓶颈。
+
+通过以上措施可以最大限度地在整个 TDengine 集群中实现负载均衡，负载均衡也能反过来提升系统总的数据处理能力。
+
+在初始的负载均衡建立起来之后，如果由于删库、删表等动作，特别是删库动作会导致属于它的 vnode 都被删除，这有可能会造成一定程度的负载失衡，在后续版本中会提供重新平衡的方法。但如果有新的数据库建立，TDengine 也能够一定程度自我再平衡而无须人工干预。
--- a/include/common/tcommon.h
+++ b/include/common/tcommon.h
@@ -57,7 +57,7 @@ enum {
  // STREAM_INPUT__TABLE_SCAN,
  STREAM_INPUT__TQ_SCAN,
  STREAM_INPUT__DATA_RETRIEVE,
-  STREAM_INPUT__TRIGGER,
+  STREAM_INPUT__GET_RES,
  STREAM_INPUT__CHECKPOINT,
  STREAM_INPUT__DROP,
 };
@@ -155,10 +155,10 @@ typedef struct SQueryTableDataCond {
  int32_t      numOfCols;
  SColumnInfo* colList;
  int32_t      type;  // data block load type:
-//  int32_t      numOfTWindows;
-  STimeWindow  twindows;
-  int64_t      startVersion;
-  int64_t      endVersion;
+                      //  int32_t      numOfTWindows;
+  STimeWindow twindows;
+  int64_t     startVersion;
+  int64_t     endVersion;
 } SQueryTableDataCond;

 int32_t tEncodeDataBlock(void** buf, const SSDataBlock* pBlock);

--- a/include/common/tmsg.h
+++ b/include/common/tmsg.h
@@ -525,6 +525,7 @@ typedef struct {
  int8_t   superUser;
  int8_t   connType;
  SEpSet   epSet;
+  int32_t  svrTimestamp;
  char     sVer[TSDB_VERSION_LEN];
  char     sDetailVer[128];
 } SConnectRsp;
@@ -1968,7 +1969,7 @@ typedef struct SVCreateTbReq {
  int8_t   type;
  union {
    struct {
-      char*    name;    // super table name
+      char*    name;  // super table name
      tb_uid_t suid;
      SArray*  tagName;
      uint8_t* pTag;
@@ -2233,6 +2234,7 @@ typedef struct {
 typedef struct {
  int64_t reqId;
  int64_t rspId;
+  int32_t svrTimestamp;
  SArray* rsps;  // SArray<SClientHbRsp>
 } SClientHbBatchRsp;

@@ -2437,9 +2439,6 @@ typedef struct {
  int8_t igNotExists;
 } SMDropStreamReq;

-int32_t tSerializeSMDropStreamReq(void* buf, int32_t bufLen, const SMDropStreamReq* pReq);
-int32_t tDeserializeSMDropStreamReq(void* buf, int32_t bufLen, SMDropStreamReq* pReq);
-
 typedef struct {
  int8_t reserved;
 } SMDropStreamRsp;
@@ -2454,6 +2453,27 @@ typedef struct {
  int8_t reserved;
 } SVDropStreamTaskRsp;

+int32_t tSerializeSMDropStreamReq(void* buf, int32_t bufLen, const SMDropStreamReq* pReq);
+int32_t tDeserializeSMDropStreamReq(void* buf, int32_t bufLen, SMDropStreamReq* pReq);
+
+typedef struct {
+  char   name[TSDB_STREAM_FNAME_LEN];
+  int8_t igNotExists;
+} SMRecoverStreamReq;
+
+typedef struct {
+  int8_t reserved;
+} SMRecoverStreamRsp;
+
+typedef struct {
+  int64_t recoverObjUid;
+  int32_t taskId;
+  int32_t hasCheckPoint;
+} SMVStreamGatherInfoReq;
+
+int32_t tSerializeSMRecoverStreamReq(void* buf, int32_t bufLen, const SMRecoverStreamReq* pReq);
+int32_t tDeserializeSMRecoverStreamReq(void* buf, int32_t bufLen, SMRecoverStreamReq* pReq);
+
 typedef struct {
  int64_t leftForVer;
  int32_t vgId;
@@ -2876,7 +2896,8 @@ static FORCE_INLINE int32_t tEncodeSMqMetaRsp(void** buf, const SMqMetaRsp* pRsp
 }

 static FORCE_INLINE void* tDecodeSMqMetaRsp(const void* buf, SMqMetaRsp* pRsp) {
-  buf = taosDecodeFixedI64(buf, &pRsp->reqOffset);buf = taosDecodeFixedI64(buf, &pRsp->rspOffset);
+  buf = taosDecodeFixedI64(buf, &pRsp->reqOffset);
+  buf = taosDecodeFixedI64(buf, &pRsp->rspOffset);
  buf = taosDecodeFixedI16(buf, &pRsp->resMsgType);
  buf = taosDecodeFixedI32(buf, &pRsp->metaRspLen);
  buf = taosDecodeBinary(buf, &pRsp->metaRsp, pRsp->metaRspLen);

--- a/include/common/tmsgdef.h
+++ b/include/common/tmsgdef.h
@@ -131,6 +131,7 @@ enum {
  TD_DEF_MSG_TYPE(TDMT_MND_CREATE_STREAM, "create-stream", SCMCreateStreamReq, SCMCreateStreamRsp)
  TD_DEF_MSG_TYPE(TDMT_MND_ALTER_STREAM, "alter-stream", NULL, NULL)
  TD_DEF_MSG_TYPE(TDMT_MND_DROP_STREAM, "drop-stream", NULL, NULL)
+  TD_DEF_MSG_TYPE(TDMT_MND_RECOVER_STREAM, "recover-stream", NULL, NULL)
  TD_DEF_MSG_TYPE(TDMT_MND_CREATE_INDEX, "create-index", NULL, NULL)
  TD_DEF_MSG_TYPE(TDMT_MND_DROP_INDEX, "drop-index", NULL, NULL)
  TD_DEF_MSG_TYPE(TDMT_MND_GET_INDEX, "get-index", NULL, NULL)

--- a/include/libs/executor/executor.h
+++ b/include/libs/executor/executor.h
@@ -192,6 +192,8 @@ int32_t qExtractStreamScanner(qTaskInfo_t tinfo, void** scanner);

 int32_t qStreamInput(qTaskInfo_t tinfo, void* pItem);

+int32_t qStreamPrepareRecover(qTaskInfo_t tinfo, int64_t startVer, int64_t endVer);
+
 #ifdef __cplusplus
 }
 #endif

--- a/include/libs/scalar/scalar.h
+++ b/include/libs/scalar/scalar.h
@@ -103,6 +103,7 @@ int32_t minScalarFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *
 int32_t maxScalarFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput);
 int32_t avgScalarFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput);
 int32_t stddevScalarFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput);
+int32_t leastSQRScalarFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput);

 #ifdef __cplusplus
 }

--- a/include/libs/scheduler/scheduler.h
+++ b/include/libs/scheduler/scheduler.h
@@ -25,11 +25,6 @@ extern "C" {

 extern tsem_t schdRspSem;

-typedef struct SSchedulerCfg {
-  uint32_t maxJobNum;
-  int32_t  maxNodeTableNum;
-} SSchedulerCfg;
-
 typedef struct SQueryProfileSummary {
  int64_t startTs;      // Object created and added into the message queue
  int64_t endTs;        // the timestamp when the task is completed
@@ -84,7 +79,7 @@ typedef struct SSchedulerReq {
 } SSchedulerReq;


-int32_t schedulerInit(SSchedulerCfg *cfg);
+int32_t schedulerInit(void);

 int32_t schedulerExecJob(SSchedulerReq *pReq, int64_t *pJob);

@@ -96,6 +91,8 @@ int32_t schedulerGetTasksStatus(int64_t job, SArray *pSub);

 void schedulerStopQueryHb(void *pTrans);

+int32_t schedulerUpdatePolicy(int32_t policy);
+int32_t schedulerEnableReSchedule(bool enableResche);

 /**
 * Cancel query job

--- a/include/libs/stream/tstream.h
+++ b/include/libs/stream/tstream.h
@@ -34,6 +34,10 @@ typedef struct SStreamTask SStreamTask;
 enum {
  TASK_STATUS__NORMAL = 0,
  TASK_STATUS__DROPPING,
+  TASK_STATUS__FAIL,
+  TASK_STATUS__STOP,
+  TASK_STATUS__PREPARE_RECOVER,
+  TASK_STATUS__RECOVERING,
 };

 enum {
@@ -72,6 +76,7 @@ typedef struct {
  int8_t type;

  int32_t srcVgId;
+  int32_t childId;
  int64_t sourceVer;

  SArray* blocks;  // SArray<SSDataBlock*>
@@ -222,6 +227,8 @@ typedef struct {
  int32_t nodeId;
  int32_t childId;
  int32_t taskId;
+  int64_t checkpointVer;
+  int64_t processedVer;
  SEpSet  epSet;
 } SStreamChildEpInfo;

@@ -232,6 +239,7 @@ typedef struct SStreamTask {
  int8_t  execType;
  int8_t  sinkType;
  int8_t  dispatchType;
+  int8_t  isStreamDistributed;
  int16_t dispatchMsgType;

  int8_t taskStatus;
@@ -242,6 +250,13 @@ typedef struct SStreamTask {
  int32_t nodeId;
  SEpSet  epSet;

+  // used for semi or single task,
+  // while final task should have processedVer for each child
+  int64_t recoverSnapVer;
+  int64_t startVer;
+  int64_t checkpointVer;
+  int64_t processedVer;
+
  // children info
  SArray* childEpInfo;  // SArray<SStreamChildEpInfo*>

@@ -316,12 +331,12 @@ static FORCE_INLINE int32_t streamTaskInput(SStreamTask* pTask, SStreamQueueItem
  } else if (pItem->type == STREAM_INPUT__CHECKPOINT) {
    taosWriteQitem(pTask->inputQueue->queue, pItem);
    // qStreamInput(pTask->exec.executor, pItem);
-  } else if (pItem->type == STREAM_INPUT__TRIGGER) {
+  } else if (pItem->type == STREAM_INPUT__GET_RES) {
    taosWriteQitem(pTask->inputQueue->queue, pItem);
    // qStreamInput(pTask->exec.executor, pItem);
  }

-  if (pItem->type != STREAM_INPUT__TRIGGER && pItem->type != STREAM_INPUT__CHECKPOINT && pTask->triggerParam != 0) {
+  if (pItem->type != STREAM_INPUT__GET_RES && pItem->type != STREAM_INPUT__CHECKPOINT && pTask->triggerParam != 0) {
    atomic_val_compare_exchange_8(&pTask->triggerStatus, TASK_TRIGGER_STATUS__IN_ACTIVE, TASK_TRIGGER_STATUS__ACTIVE);
  }

@@ -420,6 +435,36 @@ typedef struct {
  int8_t  inputStatus;
 } SStreamTaskRecoverRsp;

+int32_t tEncodeStreamTaskRecoverReq(SEncoder* pEncoder, const SStreamTaskRecoverReq* pReq);
+int32_t tDecodeStreamTaskRecoverReq(SDecoder* pDecoder, SStreamTaskRecoverReq* pReq);
+
+int32_t tEncodeStreamTaskRecoverRsp(SEncoder* pEncoder, const SStreamTaskRecoverRsp* pRsp);
+int32_t tDecodeStreamTaskRecoverRsp(SDecoder* pDecoder, SStreamTaskRecoverRsp* pRsp);
+
+typedef struct {
+  int64_t streamId;
+  int32_t taskId;
+} SMStreamTaskRecoverReq;
+
+typedef struct {
+  int64_t streamId;
+  int32_t taskId;
+} SMStreamTaskRecoverRsp;
+
+int32_t tEncodeSMStreamTaskRecoverReq(SEncoder* pEncoder, const SMStreamTaskRecoverReq* pReq);
+int32_t tDecodeSMStreamTaskRecoverReq(SDecoder* pDecoder, SMStreamTaskRecoverReq* pReq);
+
+int32_t tEncodeSMStreamTaskRecoverRsp(SEncoder* pEncoder, const SMStreamTaskRecoverRsp* pRsp);
+int32_t tDecodeSMStreamTaskRecoverRsp(SDecoder* pDecoder, SMStreamTaskRecoverRsp* pRsp);
+
+typedef struct {
+  int64_t streamId;
+} SPStreamTaskRecoverReq;
+
+typedef struct {
+  int8_t reserved;
+} SPStreamTaskRecoverRsp;
+
 int32_t tDecodeStreamDispatchReq(SDecoder* pDecoder, SStreamDispatchReq* pReq);
 int32_t tDecodeStreamRetrieveReq(SDecoder* pDecoder, SStreamRetrieveReq* pReq);


--- a/include/util/taoserror.h
+++ b/include/util/taoserror.h
@@ -73,6 +73,7 @@ int32_t* taosGetErrno();
 #define TSDB_CODE_MSG_DECODE_ERROR              TAOS_DEF_ERROR_CODE(0, 0x0031)
 #define TSDB_CODE_NO_AVAIL_DISK                 TAOS_DEF_ERROR_CODE(0, 0x0032)
 #define TSDB_CODE_NOT_FOUND                     TAOS_DEF_ERROR_CODE(0, 0x0033)
+#define TSDB_CODE_TIME_UNSYNCED                 TAOS_DEF_ERROR_CODE(0, 0x0034)

 #define TSDB_CODE_REF_NO_MEMORY                 TAOS_DEF_ERROR_CODE(0, 0x0040)
 #define TSDB_CODE_REF_FULL                      TAOS_DEF_ERROR_CODE(0, 0x0041)
@@ -122,7 +123,7 @@ int32_t* taosGetErrno();
 #define TSDB_CODE_TSC_DUP_COL_NAMES             TAOS_DEF_ERROR_CODE(0, 0x021D)
 #define TSDB_CODE_TSC_INVALID_TAG_LENGTH        TAOS_DEF_ERROR_CODE(0, 0x021E)
 #define TSDB_CODE_TSC_INVALID_COLUMN_LENGTH     TAOS_DEF_ERROR_CODE(0, 0x021F)
-#define TSDB_CODE_TSC_DUP_TAG_NAMES             TAOS_DEF_ERROR_CODE(0, 0x0220)
+#define TSDB_CODE_TSC_DUP_NAMES                 TAOS_DEF_ERROR_CODE(0, 0x0220)
 #define TSDB_CODE_TSC_INVALID_JSON              TAOS_DEF_ERROR_CODE(0, 0x0221)
 #define TSDB_CODE_TSC_INVALID_JSON_TYPE         TAOS_DEF_ERROR_CODE(0, 0x0222)
 #define TSDB_CODE_TSC_VALUE_OUT_OF_RANGE        TAOS_DEF_ERROR_CODE(0, 0x0223)
@@ -615,6 +616,7 @@ int32_t* taosGetErrno();
 #define TSDB_CODE_SML_INVALID_PRECISION_TYPE    TAOS_DEF_ERROR_CODE(0, 0x3001)
 #define TSDB_CODE_SML_INVALID_DATA              TAOS_DEF_ERROR_CODE(0, 0x3002)
 #define TSDB_CODE_SML_INVALID_DB_CONF           TAOS_DEF_ERROR_CODE(0, 0x3003)
+#define TSDB_CODE_SML_NOT_SAME_TYPE             TAOS_DEF_ERROR_CODE(0, 0x3004)

 //tsma
 #define TSDB_CODE_TSMA_INIT_FAILED               TAOS_DEF_ERROR_CODE(0, 0x3100)

--- a/source/client/inc/clientInt.h
+++ b/source/client/inc/clientInt.h
@@ -286,7 +286,7 @@ static FORCE_INLINE SReqResultInfo* tscGetCurResInfo(TAOS_RES* res) {
 extern SAppInfo appInfo;
 extern int32_t  clientReqRefPool;
 extern int32_t  clientConnRefPool;
-extern void*    tscQhandle;
+extern int32_t  timestampDeltaLimit;

 __async_send_cb_fn_t getMsgRspHandle(int32_t msgType);


--- a/source/client/src/clientEnv.c
+++ b/source/client/src/clientEnv.c
@@ -35,6 +35,8 @@ SAppInfo appInfo;
 int32_t  clientReqRefPool = -1;
 int32_t  clientConnRefPool = -1;

+int32_t timestampDeltaLimit = 900;  // s
+
 static TdThreadOnce tscinit = PTHREAD_ONCE_INIT;
 volatile int32_t    tscInitRes = 0;

@@ -181,7 +183,7 @@ void destroyTscObj(void *pObj) {

  destroyAllRequests(pTscObj->pRequests);
  taosHashCleanup(pTscObj->pRequests);
-  
+
  schedulerStopQueryHb(pTscObj->pAppInfo->pTransporter);
  tscDebug("connObj 0x%" PRIx64 " p:%p destroyed, remain inst totalConn:%" PRId64, pTscObj->id, pTscObj,
           pTscObj->pAppInfo->numOfConns);
@@ -363,8 +365,7 @@ void taos_init_imp(void) {
  SCatalogCfg cfg = {.maxDBCacheNum = 100, .maxTblCacheNum = 100};
  catalogInit(&cfg);

-  SSchedulerCfg scfg = {.maxJobNum = 100};
-  schedulerInit(&scfg);
+  schedulerInit();
  tscDebug("starting to initialize TAOS driver");

  taosSetCoreDump(true);

--- a/source/client/src/clientHb.c
+++ b/source/client/src/clientHb.c
@@ -70,7 +70,7 @@ static int32_t hbProcessDBInfoRsp(void *value, int32_t valueLen, struct SCatalog
      if (NULL == vgInfo) {
        return TSDB_CODE_TSC_OUT_OF_MEMORY;
      }
-      
+
      vgInfo->vgVersion = rsp->vgVersion;
      vgInfo->hashMethod = rsp->hashMethod;
      vgInfo->vgHash = taosHashInit(rsp->vgNum, taosGetDefaultHashFunction(TSDB_DATA_TYPE_INT), true, HASH_ENTRY_LOCK);
@@ -156,18 +156,18 @@ static int32_t hbQueryHbRspHandle(SAppHbMgr *pAppHbMgr, SClientHbRsp *pRsp) {
    STscObj *pTscObj = (STscObj *)acquireTscObj(pRsp->connKey.tscRid);
    if (NULL == pTscObj) {
      tscDebug("tscObj rid %" PRIx64 " not exist", pRsp->connKey.tscRid);
-    } else {      
+    } else {
      if (pRsp->query->totalDnodes > 1 && !isEpsetEqual(&pTscObj->pAppInfo->mgmtEp.epSet, &pRsp->query->epSet)) {
-        SEpSet* pOrig = &pTscObj->pAppInfo->mgmtEp.epSet;
-        SEp* pOrigEp = &pOrig->eps[pOrig->inUse];
-        SEp* pNewEp = &pRsp->query->epSet.eps[pRsp->query->epSet.inUse];
-        tscDebug("mnode epset updated from %d/%d=>%s:%d to %d/%d=>%s:%d in hb", 
-            pOrig->inUse, pOrig->numOfEps, pOrigEp->fqdn, pOrigEp->port, 
-            pRsp->query->epSet.inUse, pRsp->query->epSet.numOfEps, pNewEp->fqdn, pNewEp->port);
-            
+        SEpSet *pOrig = &pTscObj->pAppInfo->mgmtEp.epSet;
+        SEp    *pOrigEp = &pOrig->eps[pOrig->inUse];
+        SEp    *pNewEp = &pRsp->query->epSet.eps[pRsp->query->epSet.inUse];
+        tscDebug("mnode epset updated from %d/%d=>%s:%d to %d/%d=>%s:%d in hb", pOrig->inUse, pOrig->numOfEps,
+                 pOrigEp->fqdn, pOrigEp->port, pRsp->query->epSet.inUse, pRsp->query->epSet.numOfEps, pNewEp->fqdn,
+                 pNewEp->port);
+
        updateEpSet_s(&pTscObj->pAppInfo->mgmtEp, &pRsp->query->epSet);
      }
-      
+
      pTscObj->pAppInfo->totalDnodes = pRsp->query->totalDnodes;
      pTscObj->pAppInfo->onlineDnodes = pRsp->query->onlineDnodes;
      pTscObj->connId = pRsp->query->connId;
@@ -263,13 +263,20 @@ static int32_t hbQueryHbRspHandle(SAppHbMgr *pAppHbMgr, SClientHbRsp *pRsp) {
 }

 static int32_t hbAsyncCallBack(void *param, SDataBuf *pMsg, int32_t code) {
-  static int32_t emptyRspNum = 0;
+  static int32_t    emptyRspNum = 0;
  char             *key = (char *)param;
  SClientHbBatchRsp pRsp = {0};
  if (TSDB_CODE_SUCCESS == code) {
    tDeserializeSClientHbBatchRsp(pMsg->pData, pMsg->len, &pRsp);
  }
-  
+
+  int32_t now = taosGetTimestampSec();
+  int32_t delta = abs(now - pRsp.svrTimestamp);
+  if (delta > timestampDeltaLimit) {
+    code = TSDB_CODE_TIME_UNSYNCED;
+    tscError("time diff: %ds is too big", delta);
+  }
+
  int32_t rspNum = taosArrayGetSize(pRsp.rsps);

  taosThreadMutexLock(&appInfo.mutex);
@@ -286,7 +293,7 @@ static int32_t hbAsyncCallBack(void *param, SDataBuf *pMsg, int32_t code) {
  taosMemoryFreeClear(param);

  if (code != 0) {
-    (*pInst)->onlineDnodes = 0;
+    (*pInst)->onlineDnodes = ((*pInst)->totalDnodes ? 0 : -1);
  }

  if (rspNum) {
@@ -373,7 +380,7 @@ int32_t hbGetQueryBasicInfo(SClientHbKey *connKey, SClientHbReq *req) {
    releaseTscObj(connKey->tscRid);
    return TSDB_CODE_QRY_OUT_OF_MEMORY;
  }
-  
+
  hbBasic->connId = pTscObj->connId;

  int32_t numOfQueries = pTscObj->pRequests ? taosHashGetSize(pTscObj->pRequests) : 0;
@@ -392,7 +399,6 @@ int32_t hbGetQueryBasicInfo(SClientHbKey *connKey, SClientHbReq *req) {
    return TSDB_CODE_QRY_OUT_OF_MEMORY;
  }

-
  int32_t code = hbBuildQueryDesc(hbBasic, pTscObj);
  if (code) {
    releaseTscObj(connKey->tscRid);
@@ -436,13 +442,12 @@ int32_t hbGetExpiredUserInfo(SClientHbKey *connKey, struct SCatalog *pCatalog, S
  if (NULL == req->info) {
    req->info = taosHashInit(64, hbKeyHashFunc, 1, HASH_ENTRY_LOCK);
  }
-  
+
  taosHashPut(req->info, &kv.key, sizeof(kv.key), &kv, sizeof(kv));

  return TSDB_CODE_SUCCESS;
 }

-
 int32_t hbGetExpiredDBInfo(SClientHbKey *connKey, struct SCatalog *pCatalog, SClientHbReq *req) {
  SDbVgVersion *dbs = NULL;
  uint32_t      dbNum = 0;
@@ -483,8 +488,8 @@ int32_t hbGetExpiredDBInfo(SClientHbKey *connKey, struct SCatalog *pCatalog, SCl

 int32_t hbGetExpiredStbInfo(SClientHbKey *connKey, struct SCatalog *pCatalog, SClientHbReq *req) {
  SSTableVersion *stbs = NULL;
-  uint32_t            stbNum = 0;
-  int32_t             code = 0;
+  uint32_t        stbNum = 0;
+  int32_t         code = 0;

  code = catalogGetExpiredSTables(pCatalog, &stbs, &stbNum);
  if (TSDB_CODE_SUCCESS != code) {
@@ -521,20 +526,19 @@ int32_t hbGetExpiredStbInfo(SClientHbKey *connKey, struct SCatalog *pCatalog, SC
 }

 int32_t hbGetAppInfo(int64_t clusterId, SClientHbReq *req) {
-  SAppHbReq* pApp = taosHashGet(clientHbMgr.appSummary, &clusterId, sizeof(clusterId));
+  SAppHbReq *pApp = taosHashGet(clientHbMgr.appSummary, &clusterId, sizeof(clusterId));
  if (NULL != pApp) {
    memcpy(&req->app, pApp, sizeof(*pApp));
  } else {
    memset(&req->app.summary, 0, sizeof(req->app.summary));
    req->app.pid = taosGetPId();
    req->app.appId = clientHbMgr.appId;
-    taosGetAppName(req->app.name, NULL);    
+    taosGetAppName(req->app.name, NULL);
  }

  return TSDB_CODE_SUCCESS;
 }

-
 int32_t hbQueryHbReqHandle(SClientHbKey *connKey, void *param, SClientHbReq *req) {
  int64_t         *clusterId = (int64_t *)param;
  struct SCatalog *pCatalog = NULL;
@@ -602,7 +606,7 @@ SClientHbBatchReq *hbGatherAllInfo(SAppHbMgr *pAppHbMgr) {
      continue;
    }

-    //hbClearClientHbReq(pOneReq);
+    // hbClearClientHbReq(pOneReq);

    pIter = taosHashIterate(pAppHbMgr->activeInfo, pIter);
  }
@@ -615,11 +619,9 @@ SClientHbBatchReq *hbGatherAllInfo(SAppHbMgr *pAppHbMgr) {
  return pBatchReq;
 }

-void hbThreadFuncUnexpectedStopped(void) {
-  atomic_store_8(&clientHbMgr.threadStop, 2);
-}
+void hbThreadFuncUnexpectedStopped(void) { atomic_store_8(&clientHbMgr.threadStop, 2); }

-void hbMergeSummary(SAppClusterSummary* dst, SAppClusterSummary* src) {
+void hbMergeSummary(SAppClusterSummary *dst, SAppClusterSummary *src) {
  dst->numOfInsertsReq += src->numOfInsertsReq;
  dst->numOfInsertRows += src->numOfInsertRows;
  dst->insertElapsedTime += src->insertElapsedTime;
@@ -633,7 +635,7 @@ void hbMergeSummary(SAppClusterSummary* dst, SAppClusterSummary* src) {

 int32_t hbGatherAppInfo(void) {
  SAppHbReq req = {0};
-  int sz = taosArrayGetSize(clientHbMgr.appHbMgrs);
+  int       sz = taosArrayGetSize(clientHbMgr.appHbMgrs);
  if (sz > 0) {
    req.pid = taosGetPId();
    req.appId = clientHbMgr.appId;
@@ -641,11 +643,11 @@ int32_t hbGatherAppInfo(void) {
  }

  taosHashClear(clientHbMgr.appSummary);
-  
+
  for (int32_t i = 0; i < sz; ++i) {
    SAppHbMgr *pAppHbMgr = taosArrayGetP(clientHbMgr.appHbMgrs, i);
-    uint64_t clusterId = pAppHbMgr->pAppInstInfo->clusterId;
-    SAppHbReq* pApp = taosHashGet(clientHbMgr.appSummary, &clusterId, sizeof(clusterId));
+    uint64_t   clusterId = pAppHbMgr->pAppInstInfo->clusterId;
+    SAppHbReq *pApp = taosHashGet(clientHbMgr.appSummary, &clusterId, sizeof(clusterId));
    if (NULL == pApp) {
      memcpy(&req.summary, &pAppHbMgr->pAppInstInfo->summary, sizeof(req.summary));
      req.startTime = pAppHbMgr->startTime;
@@ -654,7 +656,7 @@ int32_t hbGatherAppInfo(void) {
      if (pAppHbMgr->startTime < pApp->startTime) {
        pApp->startTime = pAppHbMgr->startTime;
      }
-      
+
      hbMergeSummary(&pApp->summary, &pAppHbMgr->pAppInstInfo->summary);
    }
  }
@@ -662,7 +664,6 @@ int32_t hbGatherAppInfo(void) {
  return TSDB_CODE_SUCCESS;
 }

-
 static void *hbThreadFunc(void *param) {
  setThreadName("hb");
 #ifdef WINDOWS
@@ -681,7 +682,7 @@ static void *hbThreadFunc(void *param) {
    if (sz > 0) {
      hbGatherAppInfo();
    }
-    
+
    for (int i = 0; i < sz; i++) {
      SAppHbMgr *pAppHbMgr = taosArrayGetP(clientHbMgr.appHbMgrs, i);

@@ -698,7 +699,7 @@ static void *hbThreadFunc(void *param) {
      if (buf == NULL) {
        terrno = TSDB_CODE_TSC_OUT_OF_MEMORY;
        tFreeClientHbBatchReq(pReq);
-        //hbClearReqInfo(pAppHbMgr);
+        // hbClearReqInfo(pAppHbMgr);
        break;
      }

@@ -708,7 +709,7 @@ static void *hbThreadFunc(void *param) {
      if (pInfo == NULL) {
        terrno = TSDB_CODE_TSC_OUT_OF_MEMORY;
        tFreeClientHbBatchReq(pReq);
-        //hbClearReqInfo(pAppHbMgr);
+        // hbClearReqInfo(pAppHbMgr);
        taosMemoryFree(buf);
        break;
      }
@@ -725,7 +726,7 @@ static void *hbThreadFunc(void *param) {
      SEpSet        epSet = getEpSet_s(&pAppInstInfo->mgmtEp);
      asyncSendMsgToServer(pAppInstInfo->pTransporter, &epSet, &transporterId, pInfo);
      tFreeClientHbBatchReq(pReq);
-      //hbClearReqInfo(pAppHbMgr);
+      // hbClearReqInfo(pAppHbMgr);

      atomic_add_fetch_32(&pAppHbMgr->reportCnt, 1);
    }
@@ -759,7 +760,7 @@ static void hbStopThread() {
    return;
  }

-  taosThreadJoin(clientHbMgr.thread, NULL);    
+  taosThreadJoin(clientHbMgr.thread, NULL);

  tscDebug("hb thread stopped");
 }
@@ -808,7 +809,7 @@ void hbFreeAppHbMgr(SAppHbMgr *pTarget) {
  }
  taosHashCleanup(pTarget->activeInfo);
  pTarget->activeInfo = NULL;
-  
+
  taosMemoryFree(pTarget->key);
  taosMemoryFree(pTarget);
 }
@@ -843,7 +844,7 @@ int hbMgrInit() {

  clientHbMgr.appId = tGenIdPI64();
  tscDebug("app %" PRIx64 " initialized", clientHbMgr.appId);
-  
+
  clientHbMgr.appSummary = taosHashInit(10, taosGetDefaultHashFunction(TSDB_DATA_TYPE_BIGINT), false, HASH_NO_LOCK);
  clientHbMgr.appHbMgrs = taosArrayInit(0, sizeof(void *));
  taosThreadMutexInit(&clientHbMgr.lock, NULL);
@@ -881,7 +882,7 @@ int hbRegisterConnImpl(SAppHbMgr *pAppHbMgr, SClientHbKey connKey, int64_t clust
  SClientHbReq hbReq = {0};
  hbReq.connKey = connKey;
  hbReq.clusterId = clusterId;
-  //hbReq.info = taosHashInit(64, hbKeyHashFunc, 1, HASH_ENTRY_LOCK);
+  // hbReq.info = taosHashInit(64, hbKeyHashFunc, 1, HASH_ENTRY_LOCK);

  taosHashPut(pAppHbMgr->activeInfo, &connKey, sizeof(SClientHbKey), &hbReq, sizeof(SClientHbReq));

@@ -920,4 +921,3 @@ void hbDeregisterConn(SAppHbMgr *pAppHbMgr, SClientHbKey connKey) {

  atomic_sub_fetch_32(&pAppHbMgr->connKeyCnt, 1);
 }
-
--- a/source/client/src/clientImpl.c
+++ b/source/client/src/clientImpl.c
@@ -834,6 +834,7 @@ void schedulerExecCb(SExecResult* pResult, void* param, int32_t code) {
    tscDebug("0x%" PRIx64 " client retry to handle the error, code:%d - %s, tryCount:%d, reqId:0x%" PRIx64,
             pRequest->self, code, tstrerror(code), pRequest->retry, pRequest->requestId);
    pRequest->prevCode = code;
+    schedulerFreeJob(&pRequest->body.queryJob, 0);
    doAsyncQuery(pRequest, true);
    return;
  }

--- a/source/client/src/clientMain.c
+++ b/source/client/src/clientMain.c
@@ -131,6 +131,7 @@ void taos_close(TAOS *taos) {

  STscObj *pObj = acquireTscObj(*(int64_t *)taos);
  if (NULL == pObj) {
+    taosMemoryFree(taos);
    return;
  }


--- a/source/client/src/clientMsgHandler.c
+++ b/source/client/src/clientMsgHandler.c
@@ -52,6 +52,18 @@ int32_t processConnectRsp(void* param, SDataBuf* pMsg, int32_t code) {

  SConnectRsp connectRsp = {0};
  tDeserializeSConnectRsp(pMsg->pData, pMsg->len, &connectRsp);
+
+  int32_t now = taosGetTimestampSec();
+  int32_t delta = abs(now - connectRsp.svrTimestamp);
+  if (delta > timestampDeltaLimit) {
+    code = TSDB_CODE_TIME_UNSYNCED;
+    tscError("time diff:%ds is too big", delta);
+    taosMemoryFree(pMsg->pData);
+    setErrno(pRequest, code);
+    tsem_post(&pRequest->body.rspSem);
+    return code;
+  }
+
  /*assert(connectRsp.epSet.numOfEps > 0);*/
  if (connectRsp.epSet.numOfEps == 0) {
    taosMemoryFree(pMsg->pData);

--- a/source/client/src/clientSml.c
+++ b/source/client/src/clientSml.c
@@ -274,11 +274,16 @@ static int32_t smlGenerateSchemaAction(SSchema *colField, SHashObj *colHash, SSm
  return 0;
 }

-static int32_t smlFindNearestPowerOf2(int32_t length) {
+static int32_t smlFindNearestPowerOf2(int32_t length, uint8_t type) {
  int32_t result = 1;
  while (result <= length) {
    result *= 2;
  }
+  if (type == TSDB_DATA_TYPE_BINARY && result > TSDB_MAX_BINARY_LEN - VARSTR_HEADER_SIZE){
+    result = TSDB_MAX_BINARY_LEN - VARSTR_HEADER_SIZE;
+  } else if (type == TSDB_DATA_TYPE_NCHAR && result > (TSDB_MAX_BINARY_LEN - VARSTR_HEADER_SIZE) / TSDB_NCHAR_SIZE){
+    result = (TSDB_MAX_BINARY_LEN - VARSTR_HEADER_SIZE) / TSDB_NCHAR_SIZE;
+  }
  return result;
 }

@@ -287,7 +292,7 @@ static int32_t smlBuildColumnDescription(SSmlKv *field, char *buf, int32_t bufSi
  char    tname[TSDB_TABLE_NAME_LEN] = {0};
  memcpy(tname, field->key, field->keyLen);
  if (type == TSDB_DATA_TYPE_BINARY || type == TSDB_DATA_TYPE_NCHAR) {
-    int32_t bytes = smlFindNearestPowerOf2(field->length);
+    int32_t bytes = smlFindNearestPowerOf2(field->length, type);
    int     out = snprintf(buf, bufSize, "`%s` %s(%d)", tname, tDataTypes[field->type].name, bytes);
    *outBytes = out;
  } else {
@@ -834,7 +839,7 @@ static int32_t smlParseTS(SSmlHandle *info, const char *data, int32_t len, SArra
    ASSERT(0);
  }

-  if (ts == -1) return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+  if (ts == -1) return TSDB_CODE_INVALID_TIMESTAMP;

  // add ts to
  SSmlKv *kv = (SSmlKv *)taosMemoryCalloc(sizeof(SSmlKv), 1);
@@ -851,35 +856,41 @@ static int32_t smlParseTS(SSmlHandle *info, const char *data, int32_t len, SArra
  return TSDB_CODE_SUCCESS;
 }

-static bool smlParseValue(SSmlKv *pVal, SSmlMsgBuf *msg) {
+static int32_t smlParseValue(SSmlKv *pVal, SSmlMsgBuf *msg) {
  // binary
  if (smlIsBinary(pVal->value, pVal->length)) {
    pVal->type = TSDB_DATA_TYPE_BINARY;
    pVal->length -= BINARY_ADD_LEN;
+    if (pVal->length > TSDB_MAX_BINARY_LEN - VARSTR_HEADER_SIZE){
+      return TSDB_CODE_PAR_INVALID_VAR_COLUMN_LEN;
+    }
    pVal->value += (BINARY_ADD_LEN - 1);
-    return true;
+    return TSDB_CODE_SUCCESS;
  }
  // nchar
  if (smlIsNchar(pVal->value, pVal->length)) {
    pVal->type = TSDB_DATA_TYPE_NCHAR;
    pVal->length -= NCHAR_ADD_LEN;
+    if(pVal->length > (TSDB_MAX_NCHAR_LEN - VARSTR_HEADER_SIZE) / TSDB_NCHAR_SIZE){
+      return TSDB_CODE_PAR_INVALID_VAR_COLUMN_LEN;
+    }
    pVal->value += (NCHAR_ADD_LEN - 1);
-    return true;
+    return TSDB_CODE_SUCCESS;
  }

  // bool
  if (smlParseBool(pVal)) {
    pVal->type = TSDB_DATA_TYPE_BOOL;
    pVal->length = (int16_t)tDataTypes[pVal->type].bytes;
-    return true;
+    return TSDB_CODE_SUCCESS;
  }
  // number
  if (smlParseNumber(pVal, msg)) {
    pVal->length = (int16_t)tDataTypes[pVal->type].bytes;
-    return true;
+    return TSDB_CODE_SUCCESS;
  }

-  return false;
+  return TSDB_CODE_TSC_INVALID_VALUE;
 }

 static int32_t smlParseInfluxString(const char *sql, SSmlLineInfo *elements, SSmlMsgBuf *msg) {
@@ -906,7 +917,7 @@ static int32_t smlParseInfluxString(const char *sql, SSmlLineInfo *elements, SSm
  elements->measureLen = sql - elements->measure;
  if (IS_INVALID_TABLE_LEN(elements->measureLen)) {
    smlBuildInvalidDataMsg(msg, "measure is empty or too large than 192", NULL);
-    return TSDB_CODE_SML_INVALID_DATA;
+    return TSDB_CODE_TSC_INVALID_TABLE_ID_LENGTH;
  }

  // parse tag
@@ -1001,11 +1012,11 @@ static int32_t smlParseTelnetTags(const char *data, SArray *cols, char *childTab

    if (IS_INVALID_COL_LEN(keyLen)) {
      smlBuildInvalidDataMsg(msg, "invalid key or key is too long than 64", key);
-      return TSDB_CODE_SML_INVALID_DATA;
+      return TSDB_CODE_TSC_INVALID_COLUMN_LENGTH;
    }
    if (smlCheckDuplicateKey(key, keyLen, dumplicateKey)) {
      smlBuildInvalidDataMsg(msg, "dumplicate key", key);
-      return TSDB_CODE_TSC_DUP_TAG_NAMES;
+      return TSDB_CODE_TSC_DUP_NAMES;
    }

    // parse value
@@ -1026,7 +1037,7 @@ static int32_t smlParseTelnetTags(const char *data, SArray *cols, char *childTab

    if (valueLen == 0) {
      smlBuildInvalidDataMsg(msg, "invalid value", value);
-      return TSDB_CODE_SML_INVALID_DATA;
+      return TSDB_CODE_TSC_INVALID_VALUE;
    }

    // handle child table name
@@ -1059,7 +1070,7 @@ static int32_t smlParseTelnetString(SSmlHandle *info, const char *sql, SSmlTable
  smlParseTelnetElement(&sql, &tinfo->sTableName, &tinfo->sTableNameLen);
  if (!(tinfo->sTableName) || IS_INVALID_TABLE_LEN(tinfo->sTableNameLen)) {
    smlBuildInvalidDataMsg(&info->msgBuf, "invalid data", sql);
-    return TSDB_CODE_SML_INVALID_DATA;
+    return TSDB_CODE_TSC_INVALID_TABLE_ID_LENGTH;
  }

  // parse timestamp
@@ -1074,7 +1085,7 @@ static int32_t smlParseTelnetString(SSmlHandle *info, const char *sql, SSmlTable
  int32_t ret = smlParseTS(info, timestamp, tLen, cols);
  if (ret != TSDB_CODE_SUCCESS) {
    smlBuildInvalidDataMsg(&info->msgBuf, "invalid timestamp", sql);
-    return TSDB_CODE_SML_INVALID_DATA;
+    return ret;
  }

  // parse value
@@ -1083,7 +1094,7 @@ static int32_t smlParseTelnetString(SSmlHandle *info, const char *sql, SSmlTable
  smlParseTelnetElement(&sql, &value, &valueLen);
  if (!value || valueLen == 0) {
    smlBuildInvalidDataMsg(&info->msgBuf, "invalid value", sql);
-    return TSDB_CODE_SML_INVALID_DATA;
+    return TSDB_CODE_TSC_INVALID_VALUE;
  }

  SSmlKv *kv = (SSmlKv *)taosMemoryCalloc(sizeof(SSmlKv), 1);
@@ -1093,15 +1104,15 @@ static int32_t smlParseTelnetString(SSmlHandle *info, const char *sql, SSmlTable
  kv->keyLen = VALUE_LEN;
  kv->value = value;
  kv->length = valueLen;
-  if (!smlParseValue(kv, &info->msgBuf)) {
-    return TSDB_CODE_SML_INVALID_DATA;
+  if ((ret = smlParseValue(kv, &info->msgBuf)) != TSDB_CODE_SUCCESS) {
+    return ret;
  }

  // parse tags
  ret = smlParseTelnetTags(sql, tinfo->tags, tinfo->childTableName, info->dumplicateKey, &info->msgBuf);
  if (ret != TSDB_CODE_SUCCESS) {
    smlBuildInvalidDataMsg(&info->msgBuf, "invalid data", sql);
-    return TSDB_CODE_SML_INVALID_DATA;
+    return ret;
  }

  return TSDB_CODE_SUCCESS;
@@ -1135,11 +1146,11 @@ static int32_t smlParseCols(const char *data, int32_t len, SArray *cols, char *c

    if (IS_INVALID_COL_LEN(keyLen)) {
      smlBuildInvalidDataMsg(msg, "invalid key or key is too long than 64", key);
-      return TSDB_CODE_SML_INVALID_DATA;
+      return TSDB_CODE_TSC_INVALID_COLUMN_LENGTH;
    }
    if (smlCheckDuplicateKey(key, keyLen, dumplicateKey)) {
      smlBuildInvalidDataMsg(msg, "dumplicate key", key);
-      return TSDB_CODE_TSC_DUP_TAG_NAMES;
+      return TSDB_CODE_TSC_DUP_NAMES;
    }

    // parse value
@@ -1195,8 +1206,9 @@ static int32_t smlParseCols(const char *data, int32_t len, SArray *cols, char *c
    if (isTag) {
      kv->type = TSDB_DATA_TYPE_NCHAR;
    } else {
-      if (!smlParseValue(kv, msg)) {
-        return TSDB_CODE_SML_INVALID_DATA;
+      int32_t ret = smlParseValue(kv, msg);
+      if (ret != TSDB_CODE_SUCCESS) {
+        return ret;
      }
    }
  }
@@ -1204,8 +1216,8 @@ static int32_t smlParseCols(const char *data, int32_t len, SArray *cols, char *c
  return TSDB_CODE_SUCCESS;
 }

-static bool smlUpdateMeta(SHashObj *metaHash, SArray *metaArray, SArray *cols, SSmlMsgBuf *msg) {
-  for (int i = 0; i < taosArrayGetSize(cols); ++i) {  // jump timestamp
+static int32_t smlUpdateMeta(SHashObj *metaHash, SArray *metaArray, SArray *cols, SSmlMsgBuf *msg) {
+  for (int i = 0; i < taosArrayGetSize(cols); ++i) {
    SSmlKv *kv = (SSmlKv *)taosArrayGetP(cols, i);

    int16_t *index = (int16_t *)taosHashGet(metaHash, kv->key, kv->keyLen);
@@ -1213,7 +1225,7 @@ static bool smlUpdateMeta(SHashObj *metaHash, SArray *metaArray, SArray *cols, S
      SSmlKv **value = (SSmlKv **)taosArrayGet(metaArray, *index);
      if (kv->type != (*value)->type) {
        smlBuildInvalidDataMsg(msg, "the type is not the same like before", kv->key);
-        return false;
+        return TSDB_CODE_SML_NOT_SAME_TYPE;
      } else {
        if (IS_VAR_DATA_TYPE(kv->type)) {  // update string len, if bigger
          if (kv->length > (*value)->length) {
@@ -1230,7 +1242,7 @@ static bool smlUpdateMeta(SHashObj *metaHash, SArray *metaArray, SArray *cols, S
    }
  }

-  return true;
+  return TSDB_CODE_SUCCESS;
 }

 static void smlInsertMeta(SHashObj *metaHash, SArray *metaArray, SArray *cols) {
@@ -1564,10 +1576,16 @@ static int32_t smlParseTSFromJSONObj(SSmlHandle *info, cJSON *root, int64_t *tsV
  double timeDouble = value->valuedouble;
  if (smlDoubleToInt64OverFlow(timeDouble)) {
    smlBuildInvalidDataMsg(&info->msgBuf, "timestamp is too large", NULL);
-    return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+    return TSDB_CODE_INVALID_TIMESTAMP;
  }
-  if (timeDouble <= 0) {
-    return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+
+  if (timeDouble == 0) {
+    *tsVal = taosGetTimestampNs();
+    return TSDB_CODE_SUCCESS;
+  }
+
+  if (timeDouble < 0) {
+    return TSDB_CODE_INVALID_TIMESTAMP;
  }

  *tsVal = timeDouble;
@@ -1578,7 +1596,7 @@ static int32_t smlParseTSFromJSONObj(SSmlHandle *info, cJSON *root, int64_t *tsV
    timeDouble = timeDouble * NANOSECOND_PER_SEC;
    if (smlDoubleToInt64OverFlow(timeDouble)) {
      smlBuildInvalidDataMsg(&info->msgBuf, "timestamp is too large", NULL);
-      return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+      return TSDB_CODE_INVALID_TIMESTAMP;
    }
  } else if (typeLen == 2 && (type->valuestring[1] == 's' || type->valuestring[1] == 'S')) {
    switch (type->valuestring[0]) {
@@ -1589,7 +1607,7 @@ static int32_t smlParseTSFromJSONObj(SSmlHandle *info, cJSON *root, int64_t *tsV
        timeDouble = timeDouble * NANOSECOND_PER_MSEC;
        if (smlDoubleToInt64OverFlow(timeDouble)) {
          smlBuildInvalidDataMsg(&info->msgBuf, "timestamp is too large", NULL);
-          return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+          return TSDB_CODE_INVALID_TIMESTAMP;
        }
        break;
      case 'u':
@@ -1599,7 +1617,7 @@ static int32_t smlParseTSFromJSONObj(SSmlHandle *info, cJSON *root, int64_t *tsV
        timeDouble = timeDouble * NANOSECOND_PER_USEC;
        if (smlDoubleToInt64OverFlow(timeDouble)) {
          smlBuildInvalidDataMsg(&info->msgBuf, "timestamp is too large", NULL);
-          return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+          return TSDB_CODE_INVALID_TIMESTAMP;
        }
        break;
      case 'n':
@@ -1634,11 +1652,11 @@ static int32_t smlParseTSFromJSON(SSmlHandle *info, cJSON *root, SArray *cols) {
    double timeDouble = timestamp->valuedouble;
    if (smlDoubleToInt64OverFlow(timeDouble)) {
      smlBuildInvalidDataMsg(&info->msgBuf, "timestamp is too large", NULL);
-      return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+      return TSDB_CODE_INVALID_TIMESTAMP;
    }

    if (timeDouble < 0) {
-      return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+      return TSDB_CODE_INVALID_TIMESTAMP;
    }

    uint8_t tsLen = smlGetTimestampLen((int64_t)timeDouble);
@@ -1648,19 +1666,19 @@ static int32_t smlParseTSFromJSON(SSmlHandle *info, cJSON *root, SArray *cols) {
      timeDouble = timeDouble * NANOSECOND_PER_SEC;
      if (smlDoubleToInt64OverFlow(timeDouble)) {
        smlBuildInvalidDataMsg(&info->msgBuf, "timestamp is too large", NULL);
-        return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+        return TSDB_CODE_INVALID_TIMESTAMP;
      }
    } else if (tsLen == TSDB_TIME_PRECISION_MILLI_DIGITS) {
      tsVal = tsVal * NANOSECOND_PER_MSEC;
      timeDouble = timeDouble * NANOSECOND_PER_MSEC;
      if (smlDoubleToInt64OverFlow(timeDouble)) {
        smlBuildInvalidDataMsg(&info->msgBuf, "timestamp is too large", NULL);
-        return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+        return TSDB_CODE_INVALID_TIMESTAMP;
      }
    } else if (timeDouble == 0) {
      tsVal = taosGetTimestampNs();
    } else {
-      return TSDB_CODE_TSC_INVALID_TIME_STAMP;
+      return TSDB_CODE_INVALID_TIMESTAMP;
    }
  } else if (cJSON_IsObject(timestamp)) {
    int32_t ret = smlParseTSFromJSONObj(info, timestamp, &tsVal);
@@ -1779,6 +1797,14 @@ static int32_t smlConvertJSONString(SSmlKv *pVal, char *typeStr, cJSON *value) {
    return TSDB_CODE_TSC_INVALID_JSON_TYPE;
  }
  pVal->length = (int16_t)strlen(value->valuestring);
+
+  if (pVal->type == TSDB_DATA_TYPE_BINARY && pVal->length > TSDB_MAX_BINARY_LEN - VARSTR_HEADER_SIZE){
+    return TSDB_CODE_PAR_INVALID_VAR_COLUMN_LEN;
+  }
+  if (pVal->type == TSDB_DATA_TYPE_NCHAR  && pVal->length > (TSDB_MAX_NCHAR_LEN - VARSTR_HEADER_SIZE) / TSDB_NCHAR_SIZE){
+    return TSDB_CODE_PAR_INVALID_VAR_COLUMN_LEN;
+  }
+
  return smlJsonCreateSring(&pVal->value, value->valuestring, pVal->length);
 }

@@ -1913,7 +1939,7 @@ static int32_t smlParseTagsFromJSON(cJSON *root, SArray *pKVs, char *childTableN
    }
    // check duplicate keys
    if (smlCheckDuplicateKey(tag->string, keyLen, dumplicateKey)) {
-      return TSDB_CODE_TSC_DUP_TAG_NAMES;
+      return TSDB_CODE_TSC_DUP_NAMES;
    }

    // handle child table name
@@ -2033,7 +2059,7 @@ static int32_t smlParseInfluxLine(SSmlHandle *info, const char *sql) {
  }
  if (taosArrayGetSize(cols) > TSDB_MAX_COLUMNS) {
    smlBuildInvalidDataMsg(&info->msgBuf, "too many columns than 4096", NULL);
-    return TSDB_CODE_SML_INVALID_DATA;
+    return TSDB_CODE_PAR_TOO_MANY_COLUMNS;
  }

  bool            hasTable = true;
@@ -2065,7 +2091,7 @@ static int32_t smlParseInfluxLine(SSmlHandle *info, const char *sql) {

    if (taosArrayGetSize((*oneTable)->tags) > TSDB_MAX_TAGS) {
      smlBuildInvalidDataMsg(&info->msgBuf, "too many tags than 128", NULL);
-      return TSDB_CODE_SML_INVALID_DATA;
+      return TSDB_CODE_PAR_INVALID_TAGS_NUM;
    }

    (*oneTable)->sTableName = elements.measure;
@@ -2084,12 +2110,12 @@ static int32_t smlParseInfluxLine(SSmlHandle *info, const char *sql) {
  SSmlSTableMeta **tableMeta = (SSmlSTableMeta **)taosHashGet(info->superTables, elements.measure, elements.measureLen);
  if (tableMeta) {  // update meta
    ret = smlUpdateMeta((*tableMeta)->colHash, (*tableMeta)->cols, cols, &info->msgBuf);
-    if (!hasTable && ret) {
+    if (!hasTable && ret == TSDB_CODE_SUCCESS) {
      ret = smlUpdateMeta((*tableMeta)->tagHash, (*tableMeta)->tags, (*oneTable)->tags, &info->msgBuf);
    }
-    if (!ret) {
+    if (ret != TSDB_CODE_SUCCESS) {
      uError("SML:0x%" PRIx64 " smlUpdateMeta failed", info->id);
-      return TSDB_CODE_SML_INVALID_DATA;
+      return ret;
    }
  } else {
    SSmlSTableMeta *meta = smlBuildSTableMeta();
@@ -2138,7 +2164,7 @@ static int32_t smlParseTelnetLine(SSmlHandle *info, void *data) {
    smlDestroyTableInfo(info, tinfo);
    smlDestroyCols(cols);
    taosArrayDestroy(cols);
-    return TSDB_CODE_SML_INVALID_DATA;
+    return TSDB_CODE_PAR_INVALID_TAGS_NUM;
  }
  taosHashClear(info->dumplicateKey);

@@ -2169,9 +2195,9 @@ static int32_t smlParseTelnetLine(SSmlHandle *info, void *data) {
    if (!hasTable && ret) {
      ret = smlUpdateMeta((*tableMeta)->tagHash, (*tableMeta)->tags, (*oneTable)->tags, &info->msgBuf);
    }
-    if (!ret) {
+    if (ret != TSDB_CODE_SUCCESS) {
      uError("SML:0x%" PRIx64 " smlUpdateMeta failed", info->id);
-      return TSDB_CODE_SML_INVALID_DATA;
+      return ret;
    }
  } else {
    SSmlSTableMeta *meta = smlBuildSTableMeta();

--- a/source/client/test/smlTest.cpp
+++ b/source/client/test/smlTest.cpp
--- a/source/common/src/tmsg.c
+++ b/source/common/src/tmsg.c
@@ -453,6 +453,7 @@ int32_t tSerializeSClientHbBatchRsp(void *buf, int32_t bufLen, const SClientHbBa
  if (tStartEncode(&encoder) < 0) return -1;
  if (tEncodeI64(&encoder, pBatchRsp->reqId) < 0) return -1;
  if (tEncodeI64(&encoder, pBatchRsp->rspId) < 0) return -1;
+  if (tEncodeI32(&encoder, pBatchRsp->svrTimestamp) < 0) return -1;

  int32_t rspNum = taosArrayGetSize(pBatchRsp->rsps);
  if (tEncodeI32(&encoder, rspNum) < 0) return -1;
@@ -474,6 +475,7 @@ int32_t tDeserializeSClientHbBatchRsp(void *buf, int32_t bufLen, SClientHbBatchR
  if (tStartDecode(&decoder) < 0) return -1;
  if (tDecodeI64(&decoder, &pBatchRsp->reqId) < 0) return -1;
  if (tDecodeI64(&decoder, &pBatchRsp->rspId) < 0) return -1;
+  if (tDecodeI32(&decoder, &pBatchRsp->svrTimestamp) < 0) return -1;

  int32_t rspNum = 0;
  if (tDecodeI32(&decoder, &rspNum) < 0) return -1;
@@ -3613,6 +3615,7 @@ int32_t tSerializeSConnectRsp(void *buf, int32_t bufLen, SConnectRsp *pRsp) {
  if (tEncodeI8(&encoder, pRsp->superUser) < 0) return -1;
  if (tEncodeI8(&encoder, pRsp->connType) < 0) return -1;
  if (tEncodeSEpSet(&encoder, &pRsp->epSet) < 0) return -1;
+  if (tEncodeI32(&encoder, pRsp->svrTimestamp) < 0) return -1;
  if (tEncodeCStr(&encoder, pRsp->sVer) < 0) return -1;
  if (tEncodeCStr(&encoder, pRsp->sDetailVer) < 0) return -1;
  tEndEncode(&encoder);
@@ -3634,6 +3637,7 @@ int32_t tDeserializeSConnectRsp(void *buf, int32_t bufLen, SConnectRsp *pRsp) {
  if (tDecodeI8(&decoder, &pRsp->superUser) < 0) return -1;
  if (tDecodeI8(&decoder, &pRsp->connType) < 0) return -1;
  if (tDecodeSEpSet(&decoder, &pRsp->epSet) < 0) return -1;
+  if (tDecodeI32(&decoder, &pRsp->svrTimestamp) < 0) return -1;
  if (tDecodeCStrTo(&decoder, pRsp->sVer) < 0) return -1;
  if (tDecodeCStrTo(&decoder, pRsp->sDetailVer) < 0) return -1;
  tEndDecode(&decoder);
@@ -4823,6 +4827,35 @@ int32_t tDeserializeSMDropStreamReq(void *buf, int32_t bufLen, SMDropStreamReq *
  return 0;
 }

+int32_t tSerializeSMRecoverStreamReq(void *buf, int32_t bufLen, const SMRecoverStreamReq *pReq) {
+  SEncoder encoder = {0};
+  tEncoderInit(&encoder, buf, bufLen);
+
+  if (tStartEncode(&encoder) < 0) return -1;
+  if (tEncodeCStr(&encoder, pReq->name) < 0) return -1;
+  if (tEncodeI8(&encoder, pReq->igNotExists) < 0) return -1;
+
+  tEndEncode(&encoder);
+
+  int32_t tlen = encoder.pos;
+  tEncoderClear(&encoder);
+  return tlen;
+}
+
+int32_t tDeserializeSMRecoverStreamReq(void *buf, int32_t bufLen, SMRecoverStreamReq *pReq) {
+  SDecoder decoder = {0};
+  tDecoderInit(&decoder, buf, bufLen);
+
+  if (tStartDecode(&decoder) < 0) return -1;
+  if (tDecodeCStrTo(&decoder, pReq->name) < 0) return -1;
+  if (tDecodeI8(&decoder, &pReq->igNotExists) < 0) return -1;
+
+  tEndDecode(&decoder);
+
+  tDecoderClear(&decoder);
+  return 0;
+}
+
 void tFreeSCMCreateStreamReq(SCMCreateStreamReq *pReq) {
  taosMemoryFreeClear(pReq->sql);
  taosMemoryFreeClear(pReq->ast);
@@ -4945,8 +4978,8 @@ int tEncodeSVCreateTbReq(SEncoder *pCoder, const SVCreateTbReq *pReq) {
    if (tEncodeTag(pCoder, (const STag *)pReq->ctb.pTag) < 0) return -1;
    int32_t len = taosArrayGetSize(pReq->ctb.tagName);
    if (tEncodeI32(pCoder, len) < 0) return -1;
-    for (int32_t i = 0; i < len; i++){
-      char* name = taosArrayGet(pReq->ctb.tagName, i);
+    for (int32_t i = 0; i < len; i++) {
+      char *name = taosArrayGet(pReq->ctb.tagName, i);
      if (tEncodeCStr(pCoder, name) < 0) return -1;
    }
  } else if (pReq->type == TSDB_NORMAL_TABLE) {
@@ -4982,9 +5015,9 @@ int tDecodeSVCreateTbReq(SDecoder *pCoder, SVCreateTbReq *pReq) {
    int32_t len = 0;
    if (tDecodeI32(pCoder, &len) < 0) return -1;
    pReq->ctb.tagName = taosArrayInit(len, TSDB_COL_NAME_LEN);
-    if(pReq->ctb.tagName == NULL) return -1;
-    for (int32_t i = 0; i < len; i++){
-      char name[TSDB_COL_NAME_LEN] = {0};
+    if (pReq->ctb.tagName == NULL) return -1;
+    for (int32_t i = 0; i < len; i++) {
+      char  name[TSDB_COL_NAME_LEN] = {0};
      char *tmp = NULL;
      if (tDecodeCStr(pCoder, &tmp) < 0) return -1;
      strcpy(name, tmp);

--- a/source/dnode/mgmt/mgmt_mnode/src/mmInt.c
+++ b/source/dnode/mgmt/mgmt_mnode/src/mmInt.c
@@ -148,9 +148,9 @@ static int32_t mmStart(SMnodeMgmt *pMgmt) {

 static void mmStop(SMnodeMgmt *pMgmt) {
  dDebug("mnode-mgmt start to stop");
+  mndPreClose(pMgmt->pMnode);
  taosThreadRwlockWrlock(&pMgmt->lock);
  pMgmt->stopped = 1;
-  mndPreClose(pMgmt->pMnode);
  taosThreadRwlockUnlock(&pMgmt->lock);

  mndStop(pMgmt->pMnode);

--- a/source/dnode/mgmt/node_mgmt/src/dmTransport.c
+++ b/source/dnode/mgmt/node_mgmt/src/dmTransport.c
@@ -221,11 +221,11 @@ int32_t dmInitMsgHandle(SDnode *pDnode) {

 static inline int32_t dmSendReq(const SEpSet *pEpSet, SRpcMsg *pMsg) {
  SDnode *pDnode = dmInstance();
-  if (pDnode->status != DND_STAT_RUNNING) {
+  if (pDnode->status != DND_STAT_RUNNING && pMsg->msgType < TDMT_SYNC_MSG) {
    rpcFreeCont(pMsg->pCont);
    pMsg->pCont = NULL;
    terrno = TSDB_CODE_NODE_OFFLINE;
-    dError("failed to send rpc msg since %s, handle:%p", terrstr(), pMsg->info.handle);
+    dError("failed to send rpc msg:%s since %s, handle:%p", TMSG_INFO(pMsg->msgType), terrstr(), pMsg->info.handle);
    return -1;
  } else {
    rpcSendRequest(pDnode->trans.clientRpc, pEpSet, pMsg, NULL);

--- a/source/dnode/mnode/impl/inc/mndDef.h
+++ b/source/dnode/mnode/impl/inc/mndDef.h
@@ -559,6 +559,7 @@ typedef struct {
  // info
  int64_t uid;
  int8_t  status;
+  int8_t  isDistributed;
  // config
  int8_t  igExpired;
  int8_t  trigger;
@@ -586,6 +587,23 @@ typedef struct {
 int32_t tEncodeSStreamObj(SEncoder* pEncoder, const SStreamObj* pObj);
 int32_t tDecodeSStreamObj(SDecoder* pDecoder, SStreamObj* pObj);

+typedef struct {
+  char    streamName[TSDB_STREAM_FNAME_LEN];
+  int64_t uid;
+  int64_t streamUid;
+  SArray* childInfo;  // SArray<SStreamChildEpInfo>
+} SStreamCheckpointObj;
+
+#if 0
+typedef struct {
+  int64_t uid;
+  int64_t streamId;
+  int8_t  isDistributed;
+  int8_t  status;
+  int8_t  stage;
+} SStreamRecoverObj;
+#endif
+
 #ifdef __cplusplus
 }
 #endif

--- a/source/dnode/mnode/impl/inc/mndInt.h
+++ b/source/dnode/mnode/impl/inc/mndInt.h
@@ -50,8 +50,8 @@ extern "C" {
 // clang-format on

 #define SYSTABLE_SCH_TABLE_NAME_LEN ((TSDB_TABLE_NAME_LEN - 1) + VARSTR_HEADER_SIZE)
-#define SYSTABLE_SCH_DB_NAME_LEN    ((TSDB_DB_NAME_LEN - 1) + VARSTR_HEADER_SIZE)
-#define SYSTABLE_SCH_COL_NAME_LEN   ((TSDB_COL_NAME_LEN - 1) + VARSTR_HEADER_SIZE)
+#define SYSTABLE_SCH_DB_NAME_LEN ((TSDB_DB_NAME_LEN - 1) + VARSTR_HEADER_SIZE)
+#define SYSTABLE_SCH_COL_NAME_LEN ((TSDB_COL_NAME_LEN - 1) + VARSTR_HEADER_SIZE)

 typedef int32_t (*MndMsgFp)(SRpcMsg *pMsg);
 typedef int32_t (*MndInitFp)(SMnode *pMnode);
@@ -61,7 +61,7 @@ typedef void (*ShowFreeIterFp)(SMnode *pMnode, void *pIter);
 typedef struct SQWorker SQHandle;

 typedef struct {
-  const char * name;
+  const char  *name;
  MndInitFp    initFp;
  MndCleanupFp cleanupFp;
 } SMnodeStep;
@@ -70,7 +70,7 @@ typedef struct {
  int64_t        showId;
  ShowRetrieveFp retrieveFps[TSDB_MGMT_TABLE_MAX];
  ShowFreeIterFp freeIterFps[TSDB_MGMT_TABLE_MAX];
-  SCacheObj *    cache;
+  SCacheObj     *cache;
 } SShowMgmt;

 typedef struct {
@@ -84,12 +84,13 @@ typedef struct {
 } STelemMgmt;

 typedef struct {
-  tsem_t    syncSem;
+  tsem_t   syncSem;
  int64_t  sync;
  bool     standby;
  SReplica replica;
  int32_t  errCode;
  int32_t  transId;
+  int8_t   leaderTransferFinish;
 } SSyncMgmt;

 typedef struct {
@@ -107,14 +108,14 @@ typedef struct SMnode {
  bool           stopped;
  bool           restored;
  bool           deploy;
-  char *         path;
+  char          *path;
  int64_t        checkTime;
-  SSdb *         pSdb;
-  SArray *       pSteps;
-  SQHandle *     pQuery;
-  SHashObj *     infosMeta;
-  SHashObj *     perfsMeta;
-  SWal *         pWal;
+  SSdb          *pSdb;
+  SArray        *pSteps;
+  SQHandle      *pQuery;
+  SHashObj      *infosMeta;
+  SHashObj      *perfsMeta;
+  SWal          *pWal;
  SShowMgmt      showMgmt;
  SProfileMgmt   profileMgmt;
  STelemMgmt     telemMgmt;

--- a/source/dnode/mnode/impl/inc/mndTrans.h
+++ b/source/dnode/mnode/impl/inc/mndTrans.h
@@ -27,6 +27,7 @@ typedef enum {
  TRANS_STOP_FUNC_TEST = 2,
  TRANS_START_FUNC_MQ_REB = 3,
  TRANS_STOP_FUNC_MQ_REB = 4,
+  TRANS_FUNC_RECOVER_STREAM_STEP_NEXT = 5,
 } ETrnFunc;

 typedef enum {

--- a/source/dnode/mnode/impl/src/mndDef.c
+++ b/source/dnode/mnode/impl/src/mndDef.c
@@ -27,6 +27,7 @@ int32_t tEncodeSStreamObj(SEncoder *pEncoder, const SStreamObj *pObj) {

  if (tEncodeI64(pEncoder, pObj->uid) < 0) return -1;
  if (tEncodeI8(pEncoder, pObj->status) < 0) return -1;
+  if (tEncodeI8(pEncoder, pObj->isDistributed) < 0) return -1;

  if (tEncodeI8(pEncoder, pObj->igExpired) < 0) return -1;
  if (tEncodeI8(pEncoder, pObj->trigger) < 0) return -1;
@@ -72,6 +73,7 @@ int32_t tDecodeSStreamObj(SDecoder *pDecoder, SStreamObj *pObj) {

  if (tDecodeI64(pDecoder, &pObj->uid) < 0) return -1;
  if (tDecodeI8(pDecoder, &pObj->status) < 0) return -1;
+  if (tDecodeI8(pDecoder, &pObj->isDistributed) < 0) return -1;

  if (tDecodeI8(pDecoder, &pObj->igExpired) < 0) return -1;
  if (tDecodeI8(pDecoder, &pObj->trigger) < 0) return -1;

--- a/source/dnode/mnode/impl/src/mndMain.c
+++ b/source/dnode/mnode/impl/src/mndMain.c
@@ -368,7 +368,18 @@ SMnode *mndOpen(const char *path, const SMnodeOpt *pOption) {

 void mndPreClose(SMnode *pMnode) {
  if (pMnode != NULL) {
+    atomic_store_8(&(pMnode->syncMgmt.leaderTransferFinish), 0);
    syncLeaderTransfer(pMnode->syncMgmt.sync);
+
+    /*
+        mDebug("vgId:1, mnode start leader transfer");
+        // wait for leader transfer finish
+        while (!atomic_load_8(&(pMnode->syncMgmt.leaderTransferFinish))) {
+          taosMsleep(10);
+          mDebug("vgId:1, mnode waiting for leader transfer");
+        }
+        mDebug("vgId:1, mnode finish leader transfer");
+    */
  }
 }


--- a/source/dnode/mnode/impl/src/mndMnode.c
+++ b/source/dnode/mnode/impl/src/mndMnode.c
@@ -218,7 +218,6 @@ bool mndIsMnode(SMnode *pMnode, int32_t dnodeId) {
 }

 void mndGetMnodeEpSet(SMnode *pMnode, SEpSet *pEpSet) {
-#if 0  
  SSdb   *pSdb = pMnode->pSdb;
  int32_t totalMnodes = sdbGetSize(pSdb, SDB_MNODE);
  void   *pIter = NULL;
@@ -238,9 +237,10 @@ void mndGetMnodeEpSet(SMnode *pMnode, SEpSet *pEpSet) {
    addEpIntoEpSet(pEpSet, pObj->pDnode->fqdn, pObj->pDnode->port);
    sdbRelease(pSdb, pObj);
  }
-#else
-  syncGetRetryEpSet(pMnode->syncMgmt.sync, pEpSet);
-#endif
+
+  if (pEpSet->numOfEps == 0) {
+    syncGetRetryEpSet(pMnode->syncMgmt.sync, pEpSet);
+  }
 }

 static int32_t mndSetCreateMnodeRedoLogs(SMnode *pMnode, STrans *pTrans, SMnodeObj *pObj) {

--- a/source/dnode/mnode/impl/src/mndProfile.c
+++ b/source/dnode/mnode/impl/src/mndProfile.c
@@ -15,10 +15,10 @@

 #define _DEFAULT_SOURCE
 #include "mndProfile.h"
-#include "mndPrivilege.h"
 #include "mndDb.h"
 #include "mndDnode.h"
 #include "mndMnode.h"
+#include "mndPrivilege.h"
 #include "mndQnode.h"
 #include "mndShow.h"
 #include "mndStb.h"
@@ -274,6 +274,7 @@ static int32_t mndProcessConnectReq(SRpcMsg *pReq) {
  connectRsp.connId = pConn->id;
  connectRsp.connType = connReq.connType;
  connectRsp.dnodeNum = mndGetDnodeSize(pMnode);
+  connectRsp.svrTimestamp = taosGetTimestampSec();

  strcpy(connectRsp.sVer, version);
  snprintf(connectRsp.sDetailVer, sizeof(connectRsp.sDetailVer), "ver:%s\nbuild:%s\ngitinfo:%s", version, buildinfo,
@@ -623,6 +624,7 @@ static int32_t mndProcessHeartBeatReq(SRpcMsg *pReq) {
  }

  SClientHbBatchRsp batchRsp = {0};
+  batchRsp.svrTimestamp = taosGetTimestampSec();
  batchRsp.rsps = taosArrayInit(0, sizeof(SClientHbRsp));

  int32_t sz = taosArrayGetSize(batchReq.reqs);

--- a/source/dnode/mnode/impl/src/mndScheduler.c
+++ b/source/dnode/mnode/impl/src/mndScheduler.c
@@ -319,6 +319,7 @@ int32_t mndScheduleStream(SMnode* pMnode, SStreamObj* pStream) {
  int32_t totLevel = LIST_LENGTH(pPlan->pSubplans);
  ASSERT(totLevel <= 2);
  pStream->tasks = taosArrayInit(totLevel, sizeof(void*));
+  pStream->isDistributed = totLevel == 2;

  bool    hasExtraSink = false;
  bool    externalTargetDB = strcmp(pStream->sourceDb, pStream->targetDb) != 0;

--- a/source/dnode/mnode/impl/src/mndStream.c
+++ b/source/dnode/mnode/impl/src/mndStream.c
@@ -36,7 +36,7 @@ static int32_t mndStreamActionDelete(SSdb *pSdb, SStreamObj *pStream);
 static int32_t mndStreamActionUpdate(SSdb *pSdb, SStreamObj *pStream, SStreamObj *pNewStream);
 static int32_t mndProcessCreateStreamReq(SRpcMsg *pReq);
 static int32_t mndProcessDropStreamReq(SRpcMsg *pReq);
-/*static int32_t mndProcessDropStreamInRsp(SRpcMsg *pRsp);*/
+static int32_t mndProcessRecoverStreamReq(SRpcMsg *pReq);
 static int32_t mndProcessStreamMetaReq(SRpcMsg *pReq);
 static int32_t mndGetStreamMeta(SRpcMsg *pReq, SShowObj *pShow, STableMetaRsp *pMeta);
 static int32_t mndRetrieveStream(SRpcMsg *pReq, SShowObj *pShow, SSDataBlock *pBlock, int32_t rows);
@@ -55,6 +55,7 @@ int32_t mndInitStream(SMnode *pMnode) {

  mndSetMsgHandle(pMnode, TDMT_MND_CREATE_STREAM, mndProcessCreateStreamReq);
  mndSetMsgHandle(pMnode, TDMT_MND_DROP_STREAM, mndProcessDropStreamReq);
+  mndSetMsgHandle(pMnode, TDMT_MND_RECOVER_STREAM, mndProcessRecoverStreamReq);

  mndSetMsgHandle(pMnode, TDMT_STREAM_TASK_DEPLOY_RSP, mndTransProcessRsp);
  mndSetMsgHandle(pMnode, TDMT_STREAM_TASK_DROP_RSP, mndTransProcessRsp);
@@ -672,6 +673,69 @@ static int32_t mndProcessDropStreamReq(SRpcMsg *pReq) {
  return TSDB_CODE_ACTION_IN_PROGRESS;
 }

+static int32_t mndProcessRecoverStreamReq(SRpcMsg *pReq) {
+  SMnode     *pMnode = pReq->info.node;
+  SStreamObj *pStream = NULL;
+  /*SDbObj     *pDb = NULL;*/
+  /*SUserObj   *pUser = NULL;*/
+
+  SMRecoverStreamReq recoverReq = {0};
+  if (tDeserializeSMRecoverStreamReq(pReq->pCont, pReq->contLen, &recoverReq) < 0) {
+    ASSERT(0);
+    terrno = TSDB_CODE_INVALID_MSG;
+    return -1;
+  }
+
+  pStream = mndAcquireStream(pMnode, recoverReq.name);
+
+  if (pStream == NULL) {
+    if (recoverReq.igNotExists) {
+      mDebug("stream:%s, not exist, ignore not exist is set", recoverReq.name);
+      sdbRelease(pMnode->pSdb, pStream);
+      return 0;
+    } else {
+      terrno = TSDB_CODE_MND_STREAM_NOT_EXIST;
+      return -1;
+    }
+  }
+
+  if (mndCheckDbPrivilegeByName(pMnode, pReq->info.conn.user, MND_OPER_WRITE_DB, pStream->targetDb) != 0) {
+    return -1;
+  }
+
+  STrans *pTrans = mndTransCreate(pMnode, TRN_POLICY_RETRY, TRN_CONFLICT_NOTHING, pReq);
+  if (pTrans == NULL) {
+    mError("stream:%s, failed to recover since %s", recoverReq.name, terrstr());
+    sdbRelease(pMnode->pSdb, pStream);
+    return -1;
+  }
+  mDebug("trans:%d, used to drop stream:%s", pTrans->id, recoverReq.name);
+
+  // broadcast to recover all tasks
+  if (mndDropStreamTasks(pMnode, pTrans, pStream) < 0) {
+    mError("stream:%s, failed to recover task since %s", recoverReq.name, terrstr());
+    sdbRelease(pMnode->pSdb, pStream);
+    return -1;
+  }
+
+  // update stream status
+  if (mndPersistDropStreamLog(pMnode, pTrans, pStream) < 0) {
+    sdbRelease(pMnode->pSdb, pStream);
+    return -1;
+  }
+
+  if (mndTransPrepare(pMnode, pTrans) != 0) {
+    mError("trans:%d, failed to prepare recover stream trans since %s", pTrans->id, terrstr());
+    sdbRelease(pMnode->pSdb, pStream);
+    mndTransDrop(pTrans);
+    return -1;
+  }
+
+  sdbRelease(pMnode->pSdb, pStream);
+
+  return TSDB_CODE_ACTION_IN_PROGRESS;
+}
+
 int32_t mndDropStreamByDb(SMnode *pMnode, STrans *pTrans, SDbObj *pDb) {
  SSdb *pSdb = pMnode->pSdb;
  void *pIter = NULL;

--- a/source/dnode/mnode/impl/src/mndSync.c
+++ b/source/dnode/mnode/impl/src/mndSync.c
@@ -56,23 +56,24 @@ void mndSyncCommitMsg(struct SSyncFSM *pFsm, const SRpcMsg *pMsg, SFsmCbMeta cbM
    sdbSetApplyInfo(pMnode->pSdb, cbMeta.index, cbMeta.term, cbMeta.lastConfigIndex);
  }

-  if (pMgmt->transId == transId && transId != 0) {
+  if (transId <= 0) {
+    mError("trans:%d, invalid commit msg", transId);
+  } else if (transId == pMgmt->transId) {
    if (pMgmt->errCode != 0) {
      mError("trans:%d, failed to propose since %s", transId, tstrerror(pMgmt->errCode));
    }
    pMgmt->transId = 0;
    tsem_post(&pMgmt->syncSem);
  } else {
-#if 1
-    mError("trans:%d, invalid commit msg since trandId not match with %d", transId, pMgmt->transId);
-#else
    STrans *pTrans = mndAcquireTrans(pMnode, transId);
    if (pTrans != NULL) {
+      mDebug("trans:%d, execute in mnode which not leader", transId);
      mndTransExecute(pMnode, pTrans);
      mndReleaseTrans(pMnode, pTrans);
+      // sdbWriteFile(pMnode->pSdb, SDB_WRITE_DELTA);
+    } else {
+      mError("trans:%d, not found while execute in mnode since %s", transId, terrstr());
    }
-    // sdbWriteFile(pMnode->pSdb, SDB_WRITE_DELTA);
-#endif
  }
 }

@@ -153,6 +154,12 @@ int32_t mndSnapshotDoWrite(struct SSyncFSM *pFsm, void *pWriter, void *pBuf, int
  return sdbDoWrite(pMnode->pSdb, pWriter, pBuf, len);
 }

+void mndLeaderTransfer(struct SSyncFSM *pFsm, const SRpcMsg *pMsg, SFsmCbMeta cbMeta) {
+  SMnode *pMnode = pFsm->data;
+  atomic_store_8(&(pMnode->syncMgmt.leaderTransferFinish), 1);
+  mDebug("vgId:1, mnd leader transfer finish");
+}
+
 SSyncFSM *mndSyncMakeFsm(SMnode *pMnode) {
  SSyncFSM *pFsm = taosMemoryCalloc(1, sizeof(SSyncFSM));
  pFsm->data = pMnode;
@@ -160,6 +167,7 @@ SSyncFSM *mndSyncMakeFsm(SMnode *pMnode) {
  pFsm->FpPreCommitCb = NULL;
  pFsm->FpRollBackCb = NULL;
  pFsm->FpRestoreFinishCb = mndRestoreFinish;
+  pFsm->FpLeaderTransferCb = mndLeaderTransfer;
  pFsm->FpReConfigCb = mndReConfig;
  pFsm->FpGetSnapshot = mndSyncGetSnapshot;
  pFsm->FpGetSnapshotInfo = mndSyncGetSnapshotInfo;

--- a/source/dnode/mnode/sdb/inc/sdb.h
+++ b/source/dnode/mnode/sdb/inc/sdb.h
@@ -137,17 +137,18 @@ typedef enum {
  SDB_USER = 7,
  SDB_AUTH = 8,
  SDB_ACCT = 9,
-  SDB_STREAM = 10,
-  SDB_OFFSET = 11,
-  SDB_SUBSCRIBE = 12,
-  SDB_CONSUMER = 13,
-  SDB_TOPIC = 14,
-  SDB_VGROUP = 15,
-  SDB_SMA = 16,
-  SDB_STB = 17,
-  SDB_DB = 18,
-  SDB_FUNC = 19,
-  SDB_MAX = 20
+  SDB_STREAM_CK = 10,
+  SDB_STREAM = 11,
+  SDB_OFFSET = 12,
+  SDB_SUBSCRIBE = 13,
+  SDB_CONSUMER = 14,
+  SDB_TOPIC = 15,
+  SDB_VGROUP = 16,
+  SDB_SMA = 17,
+  SDB_STB = 18,
+  SDB_DB = 19,
+  SDB_FUNC = 20,
+  SDB_MAX = 21
 } ESdbType;

 typedef struct SSdbRaw {
@@ -308,7 +309,7 @@ void sdbRelease(SSdb *pSdb, void *pObj);
 * @return void* The next iterator of the table.
 */
 void *sdbFetch(SSdb *pSdb, ESdbType type, void *pIter, void **ppObj);
-void *sdbFetchAll(SSdb *pSdb, ESdbType type, void *pIter, void **ppObj, ESdbStatus *status) ;
+void *sdbFetchAll(SSdb *pSdb, ESdbType type, void *pIter, void **ppObj, ESdbStatus *status);

 /**
 * @brief Cancel a traversal

--- a/source/dnode/vnode/src/tsdb/tsdbCacheRead.c
+++ b/source/dnode/vnode/src/tsdb/tsdbCacheRead.c
@@ -177,7 +177,6 @@ int32_t tsdbRetrieveLastRow(void* pReader, SSDataBlock* pResBlock, const int32_t
      saveOneRow(pRow, pResBlock, pr, slotIds);
      taosArrayPush(pTableUidList, &pKeyInfo->uid);

-      // taosMemoryFree(pRow);
      tsdbCacheRelease(lruCache, h);

      pr->tableIndex += 1;

--- a/source/dnode/vnode/src/tsdb/tsdbRead.c
+++ b/source/dnode/vnode/src/tsdb/tsdbRead.c
@@ -830,9 +830,8 @@ static int32_t doLoadFileBlockData(STsdbReader* pReader, SDataBlockIter* pBlockI
  SBlockLoadSuppInfo* pSupInfo = &pReader->suppInfo;
  SFileBlockDumpInfo* pDumpInfo = &pReader->status.fBlockDumpInfo;

-  uint8_t *pb = NULL, *pb1 = NULL;
  int32_t  code = tsdbReadColData(pReader->pFileReader, &pBlockScanInfo->blockIdx, pBlock, pSupInfo->colIds, numOfCols,
-                                  pBlockData, &pb, &pb1);
+                                  pBlockData, NULL, NULL);
  if (code != TSDB_CODE_SUCCESS) {
    goto _error;
  }
@@ -3007,11 +3006,14 @@ SArray* tsdbRetrieveDataBlock(STsdbReader* pReader, SArray* pIdList) {

  code = doLoadFileBlockData(pReader, &pStatus->blockIter, pBlockScanInfo, &pStatus->fileBlockData);
  if (code != TSDB_CODE_SUCCESS) {
+    tBlockDataClear(&pStatus->fileBlockData);
+
    terrno = code;
    return NULL;
  }

  copyBlockDataToSDataBlock(pReader, pBlockScanInfo);
+  tBlockDataClear(&pStatus->fileBlockData);
  return pReader->pResBlock->pDataBlock;
 }


--- a/source/dnode/vnode/src/vnd/vnodeSync.c
+++ b/source/dnode/vnode/src/vnd/vnodeSync.c
@@ -536,6 +536,10 @@ static int32_t vnodeSnapshotDoWrite(struct SSyncFSM *pFsm, void *pWriter, void *
 #endif
 }

+static void vnodeLeaderTransfer(struct SSyncFSM *pFsm, const SRpcMsg *pMsg, SFsmCbMeta cbMeta) {
+  SVnode *pVnode = pFsm->data;
+}
+
 static SSyncFSM *vnodeSyncMakeFsm(SVnode *pVnode) {
  SSyncFSM *pFsm = taosMemoryCalloc(1, sizeof(SSyncFSM));
  pFsm->data = pVnode;
@@ -544,6 +548,7 @@ static SSyncFSM *vnodeSyncMakeFsm(SVnode *pVnode) {
  pFsm->FpRollBackCb = vnodeSyncRollBackMsg;
  pFsm->FpGetSnapshotInfo = vnodeSyncGetSnapshot;
  pFsm->FpRestoreFinishCb = NULL;
+  pFsm->FpLeaderTransferCb = vnodeLeaderTransfer;
  pFsm->FpReConfigCb = vnodeSyncReconfig;
  pFsm->FpSnapshotStartRead = vnodeSnapshotStartRead;
  pFsm->FpSnapshotStopRead = vnodeSnapshotStopRead;
@@ -579,8 +584,8 @@ int32_t vnodeSyncOpen(SVnode *pVnode, char *path) {
  }

  setPingTimerMS(pVnode->sync, 5000);
-  setElectTimerMS(pVnode->sync, 500);
-  setHeartbeatTimerMS(pVnode->sync, 100);
+  setElectTimerMS(pVnode->sync, 1300);
+  setHeartbeatTimerMS(pVnode->sync, 900);
  return 0;
 }


--- a/source/libs/command/CMakeLists.txt
+++ b/source/libs/command/CMakeLists.txt
@@ -8,9 +8,9 @@ target_include_directories(

 target_link_libraries(
        command
-        PRIVATE os util nodes catalog function transport qcom
+        PRIVATE os util nodes catalog function transport qcom scheduler
 )

 if(${BUILD_TEST})
    ADD_SUBDIRECTORY(test)
-endif(${BUILD_TEST})
\ No newline at end of file
+endif(${BUILD_TEST})
--- a/source/libs/command/inc/commandInt.h
+++ b/source/libs/command/inc/commandInt.h
@@ -77,6 +77,10 @@ extern "C" {
 #define EXPLAIN_MODE_FORMAT "mode=%s"
 #define EXPLAIN_STRING_TYPE_FORMAT "%s"

+#define COMMAND_RESET_LOG "resetLog"
+#define COMMAND_SCHEDULE_POLICY "schedulePolicy"
+#define COMMAND_ENABLE_RESCHEDULE "enableReSchedule"
+
 typedef struct SExplainGroup {
  int32_t   nodeNum;
  int32_t   physiPlanExecNum;

--- a/source/libs/command/src/command.c
+++ b/source/libs/command/src/command.c
@@ -17,6 +17,8 @@
 #include "catalog.h"
 #include "tdatablock.h"
 #include "tglobal.h"
+#include "commandInt.h"
+#include "scheduler.h"

 extern SConfig* tsCfg;

@@ -479,7 +481,42 @@ static int32_t execShowCreateSTable(SShowCreateTableStmt* pStmt, SRetrieveTableR
  return execShowCreateTable(pStmt, pRsp);
 }

+static int32_t execAlterCmd(char* cmd, char* value, bool* processed) {
+  int32_t code = 0;
+  
+  if (0 == strcasecmp(cmd, COMMAND_RESET_LOG)) {
+    taosResetLog();
+    cfgDumpCfg(tsCfg, 0, false);
+  } else if (0 == strcasecmp(cmd, COMMAND_SCHEDULE_POLICY)) {
+    code = schedulerUpdatePolicy(atoi(value));
+  } else if (0 == strcasecmp(cmd, COMMAND_ENABLE_RESCHEDULE)) {
+    code = schedulerEnableReSchedule(atoi(value));
+  } else {
+    goto _return;
+  }
+
+  *processed = true;
+
+_return:
+
+  if (code) {
+    terrno = code;
+  }
+  
+  return code;  
+}
+
 static int32_t execAlterLocal(SAlterLocalStmt* pStmt) {
+  bool processed = false;
+  
+  if (execAlterCmd(pStmt->config, pStmt->value, &processed)) {
+    return terrno;
+  }
+
+  if (processed) {
+    goto _return;
+  }
+  
  if (cfgSetItem(tsCfg, pStmt->config, pStmt->value, CFG_STYPE_ALTER_CMD)) {
    return terrno;
  }
@@ -488,6 +525,8 @@ static int32_t execAlterLocal(SAlterLocalStmt* pStmt) {
    return terrno;
  }

+_return:
+
  return TSDB_CODE_SUCCESS;
 }


--- a/source/libs/executor/inc/executorimpl.h
+++ b/source/libs/executor/inc/executorimpl.h
@@ -139,6 +139,12 @@ typedef struct STaskIdInfo {
  char*    str;
 } STaskIdInfo;

+enum {
+  STREAM_RECOVER_STEP__NONE = 0,
+  STREAM_RECOVER_STEP__PREPARE,
+  STREAM_RECOVER_STEP__SCAN,
+};
+
 typedef struct {
  //TODO remove prepareStatus
  STqOffsetVal   prepareStatus; // for tmq
@@ -147,6 +153,10 @@ typedef struct {
  SSDataBlock*   pullOverBlk;   // for streaming
  SWalFilterCond cond;
  int64_t        lastScanUid;
+  int8_t         recoverStep;
+  SQueryTableDataCond tableCond;
+  int64_t recoverStartVer;
+  int64_t recoverEndVer;
 } SStreamTaskInfo;

 typedef struct SExecTaskInfo {
@@ -316,12 +326,16 @@ typedef struct STagScanInfo {

 typedef struct SLastrowScanInfo {
  SSDataBlock    *pRes;
-  SArray         *pTableList;
  SReadHandle     readHandle;
  void           *pLastrowReader;
  SArray         *pColMatchInfo;
  int32_t        *pSlotIds;
  SExprSupp       pseudoExprSup;
+  int32_t         retrieveType;
+  int32_t         currentGroupIndex;
+  SSDataBlock    *pBufferredRes;
+  SArray         *pUidList;
+  int32_t         indexOfBufferedRes;
 } SLastrowScanInfo;

 typedef enum EStreamScanMode {
@@ -825,8 +839,7 @@ SOperatorInfo* createProjectOperatorInfo(SOperatorInfo* downstream, SProjectPhys
 SOperatorInfo* createSortOperatorInfo(SOperatorInfo* downstream, SSortPhysiNode* pSortPhyNode, SExecTaskInfo* pTaskInfo);
 SOperatorInfo* createMultiwayMergeOperatorInfo(SOperatorInfo** dowStreams, size_t numStreams, SMergePhysiNode* pMergePhysiNode, SExecTaskInfo* pTaskInfo);
 SOperatorInfo* createSortedMergeOperatorInfo(SOperatorInfo** downstream, int32_t numOfDownstream, SExprInfo* pExprInfo, int32_t num, SArray* pSortInfo, SArray* pGroupInfo, SExecTaskInfo* pTaskInfo);
-SOperatorInfo* createLastrowScanOperator(SLastRowScanPhysiNode* pTableScanNode, SReadHandle* readHandle,
-                                         SArray* pTableList, SExecTaskInfo* pTaskInfo);
+SOperatorInfo* createLastrowScanOperator(SLastRowScanPhysiNode* pTableScanNode, SReadHandle* readHandle, SExecTaskInfo* pTaskInfo);

 SOperatorInfo* createIntervalOperatorInfo(SOperatorInfo* downstream, SExprInfo* pExprInfo, int32_t numOfCols,
                                          SSDataBlock* pResBlock, SInterval* pInterval, int32_t primaryTsSlotId,
@@ -944,8 +957,9 @@ int32_t finalizeResultRowIntoResultDataBlock(SDiskbasedBuf* pBuf, SResultRowPosi
                                       SqlFunctionCtx* pCtx, SExprInfo* pExprInfo, int32_t numOfExprs, const int32_t* rowCellOffset,
                                       SSDataBlock* pBlock, SExecTaskInfo* pTaskInfo);

-int32_t createScanTableListInfo(STableScanPhysiNode* pTableScanNode, SReadHandle* pHandle,
+int32_t createScanTableListInfo(SScanPhysiNode *pScanNode, SNodeList* pGroupTags, bool groupSort, SReadHandle* pHandle,
                                STableListInfo* pTableListInfo, uint64_t queryId, uint64_t taskId);
+
 SOperatorInfo* createGroupSortOperatorInfo(SOperatorInfo* downstream, SGroupSortPhysiNode* pSortPhyNode,
                                           SExecTaskInfo* pTaskInfo);
 SOperatorInfo* createTableMergeScanOperatorInfo(STableScanPhysiNode* pTableScanNode, STableListInfo *pTableListInfo,

--- a/source/libs/executor/src/cachescanoperator.c
+++ b/source/libs/executor/src/cachescanoperator.c
@@ -30,15 +30,13 @@ static SSDataBlock* doScanLastrow(SOperatorInfo* pOperator);
 static void destroyLastrowScanOperator(void* param, int32_t numOfOutput);
 static int32_t extractTargetSlotId(const SArray* pColMatchInfo, SExecTaskInfo* pTaskInfo, int32_t** pSlotIds);

-SOperatorInfo* createLastrowScanOperator(SLastRowScanPhysiNode* pScanNode, SReadHandle* readHandle, SArray* pTableList,
-                                         SExecTaskInfo* pTaskInfo) {
+SOperatorInfo* createLastrowScanOperator(SLastRowScanPhysiNode* pScanNode, SReadHandle* readHandle, SExecTaskInfo* pTaskInfo) {
  SLastrowScanInfo* pInfo = taosMemoryCalloc(1, sizeof(SLastrowScanInfo));
  SOperatorInfo*    pOperator = taosMemoryCalloc(1, sizeof(SOperatorInfo));
  if (pInfo == NULL || pOperator == NULL) {
    goto _error;
  }

-  pInfo->pTableList = pTableList;
  pInfo->readHandle = *readHandle;
  pInfo->pRes = createResDataBlock(pScanNode->scan.node.pOutputDataBlockDesc);

@@ -50,8 +48,22 @@ SOperatorInfo* createLastrowScanOperator(SLastRowScanPhysiNode* pScanNode, SRead
    goto _error;
  }

-  tsdbLastRowReaderOpen(readHandle->vnode, LASTROW_RETRIEVE_TYPE_SINGLE, pTableList, taosArrayGetSize(pInfo->pColMatchInfo),
-                        &pInfo->pLastrowReader);
+  STableListInfo* pTableList = &pTaskInfo->tableqinfoList;
+
+  initResultSizeInfo(pOperator, 1024);
+  blockDataEnsureCapacity(pInfo->pRes, pOperator->resultInfo.capacity);
+  pInfo->pUidList = taosArrayInit(4, sizeof(int64_t));
+
+  // partition by tbname
+  if (taosArrayGetSize(pTableList->pGroupList) == taosArrayGetSize(pTableList->pTableList)) {
+    pInfo->retrieveType = LASTROW_RETRIEVE_TYPE_ALL;
+    tsdbLastRowReaderOpen(pInfo->readHandle.vnode, pInfo->retrieveType, pTableList->pTableList,
+                          taosArrayGetSize(pInfo->pColMatchInfo), &pInfo->pLastrowReader);
+    pInfo->pBufferredRes = createOneDataBlock(pInfo->pRes, false);
+    blockDataEnsureCapacity(pInfo->pBufferredRes, pOperator->resultInfo.capacity);
+  } else { // by tags
+    pInfo->retrieveType = LASTROW_RETRIEVE_TYPE_SINGLE;
+  }

  if (pScanNode->scan.pScanPseudoCols != NULL) {
    SExprSupp* pPseudoExpr = &pInfo->pseudoExprSup;
@@ -60,19 +72,17 @@ SOperatorInfo* createLastrowScanOperator(SLastRowScanPhysiNode* pScanNode, SRead
    pPseudoExpr->pCtx = createSqlFunctionCtx(pPseudoExpr->pExprInfo, pPseudoExpr->numOfExprs, &pPseudoExpr->rowEntryInfoOffset);
  }

-  pOperator->name = "LastrowScanOperator";
+  pOperator->name         = "LastrowScanOperator";
  pOperator->operatorType = QUERY_NODE_PHYSICAL_PLAN_LAST_ROW_SCAN;
-  pOperator->blocking = false;
-  pOperator->status = OP_NOT_OPENED;
-  pOperator->info = pInfo;
-  pOperator->pTaskInfo = pTaskInfo;
+  pOperator->blocking     = false;
+  pOperator->status       = OP_NOT_OPENED;
+  pOperator->info         = pInfo;
+  pOperator->pTaskInfo    = pTaskInfo;
  pOperator->exprSupp.numOfExprs = taosArrayGetSize(pInfo->pRes->pDataBlock);

-  initResultSizeInfo(pOperator, 1024);
-  blockDataEnsureCapacity(pInfo->pRes, pOperator->resultInfo.capacity);
-
  pOperator->fpSet =
      createOperatorFpSet(operatorDummyOpenFn, doScanLastrow, NULL, NULL, destroyLastrowScanOperator, NULL, NULL, NULL);
+
  pOperator->cost.openCost = 0;
  return pOperator;

@@ -90,43 +100,120 @@ SSDataBlock* doScanLastrow(SOperatorInfo* pOperator) {

  SLastrowScanInfo* pInfo = pOperator->info;
  SExecTaskInfo*    pTaskInfo = pOperator->pTaskInfo;
-
-  int32_t size = taosArrayGetSize(pInfo->pTableList);
+  STableListInfo*   pTableList = &pTaskInfo->tableqinfoList;
+  int32_t           size = taosArrayGetSize(pTableList->pTableList);
  if (size == 0) {
-    setTaskStatus(pTaskInfo, TASK_COMPLETED);
+    doSetOperatorCompleted(pOperator);
    return NULL;
  }

+  blockDataCleanup(pInfo->pRes);
+
  // check if it is a group by tbname
-  if (size == taosArrayGetSize(pInfo->pTableList)) {
-    blockDataCleanup(pInfo->pRes);
-    SArray* pUidList = taosArrayInit(1, sizeof(tb_uid_t));
-    int32_t code = tsdbRetrieveLastRow(pInfo->pLastrowReader, pInfo->pRes, pInfo->pSlotIds, pUidList);
-    if (code != TSDB_CODE_SUCCESS)  {
-      longjmp(pTaskInfo->env, code);
+  if (pInfo->retrieveType == LASTROW_RETRIEVE_TYPE_ALL) {
+    if (pInfo->indexOfBufferedRes >= pInfo->pBufferredRes->info.rows) {
+      blockDataCleanup(pInfo->pBufferredRes);
+      taosArrayClear(pInfo->pUidList);
+
+      int32_t code = tsdbRetrieveLastRow(pInfo->pLastrowReader, pInfo->pBufferredRes, pInfo->pSlotIds, pInfo->pUidList);
+      if (code != TSDB_CODE_SUCCESS) {
+        longjmp(pTaskInfo->env, code);
+      }
+
+      // check for tag values
+      int32_t resultRows = pInfo->pBufferredRes->info.rows;
+      ASSERT(resultRows == taosArrayGetSize(pInfo->pUidList));
+      pInfo->indexOfBufferedRes = 0;
    }

-    // check for tag values
-    if (pInfo->pRes->info.rows > 0 && pInfo->pseudoExprSup.numOfExprs > 0) {
-      SExprSupp* pSup = &pInfo->pseudoExprSup;
-      pInfo->pRes->info.uid = *(tb_uid_t*) taosArrayGet(pUidList, 0);
-      addTagPseudoColumnData(&pInfo->readHandle, pSup->pExprInfo, pSup->numOfExprs, pInfo->pRes, GET_TASKID(pTaskInfo));
+    if (pInfo->indexOfBufferedRes < pInfo->pBufferredRes->info.rows) {
+      for(int32_t i = 0; i < taosArrayGetSize(pInfo->pColMatchInfo); ++i) {
+        SColMatchInfo* pMatchInfo = taosArrayGet(pInfo->pColMatchInfo, i);
+        int32_t slotId = pMatchInfo->targetSlotId;
+
+        SColumnInfoData* pSrc = taosArrayGet(pInfo->pBufferredRes->pDataBlock, slotId);
+        SColumnInfoData* pDst = taosArrayGet(pInfo->pRes->pDataBlock, slotId);
+
+        char* p = colDataGetData(pSrc, pInfo->indexOfBufferedRes);
+        bool isNull = colDataIsNull_s(pSrc, pInfo->indexOfBufferedRes);
+        colDataAppend(pDst, 0, p, isNull);
+      }
+
+      pInfo->pRes->info.uid = *(tb_uid_t*)taosArrayGet(pInfo->pUidList, pInfo->indexOfBufferedRes);
+      pInfo->pRes->info.rows = 1;
+
+      if (pInfo->pseudoExprSup.numOfExprs > 0) {
+        SExprSupp* pSup = &pInfo->pseudoExprSup;
+        int32_t code = addTagPseudoColumnData(&pInfo->readHandle, pSup->pExprInfo, pSup->numOfExprs, pInfo->pRes,
+                               GET_TASKID(pTaskInfo));
+        if (code != TSDB_CODE_SUCCESS) {
+          pTaskInfo->code = code;
+          return NULL;
+        }
+      }
+
+      if (pTableList->map != NULL) {
+        int64_t* groupId = taosHashGet(pTableList->map, &pInfo->pRes->info.uid, sizeof(int64_t));
+        pInfo->pRes->info.groupId = *groupId;
+      } else {
+        ASSERT(taosArrayGetSize(pTableList->pTableList) == 1);
+        STableKeyInfo* pKeyInfo = taosArrayGet(pTableList->pTableList, 0);
+        pInfo->pRes->info.groupId = pKeyInfo->groupId;
+      }
+
+      pInfo->indexOfBufferedRes += 1;
+      return pInfo->pRes;
+    } else {
+      doSetOperatorCompleted(pOperator);
+      return NULL;
+    }
+  } else {
+    size_t totalGroups = taosArrayGetSize(pTableList->pGroupList);
+
+    while (pInfo->currentGroupIndex < totalGroups) {
+      SArray* pGroupTableList = taosArrayGetP(pTableList->pGroupList, pInfo->currentGroupIndex);
+
+      tsdbLastRowReaderOpen(pInfo->readHandle.vnode, pInfo->retrieveType, pGroupTableList,
+                            taosArrayGetSize(pInfo->pColMatchInfo), &pInfo->pLastrowReader);
+      taosArrayClear(pInfo->pUidList);
+
+      int32_t code = tsdbRetrieveLastRow(pInfo->pLastrowReader, pInfo->pRes, pInfo->pSlotIds, pInfo->pUidList);
+      if (code != TSDB_CODE_SUCCESS) {
+        longjmp(pTaskInfo->env, code);
+      }
+
+      pInfo->currentGroupIndex += 1;
+
+      // check for tag values
+      if (pInfo->pRes->info.rows > 0) {
+        if (pInfo->pseudoExprSup.numOfExprs > 0) {
+          SExprSupp* pSup = &pInfo->pseudoExprSup;
+          pInfo->pRes->info.uid = *(tb_uid_t*)taosArrayGet(pInfo->pUidList, 0);
+
+          STableKeyInfo* pKeyInfo = taosArrayGet(pGroupTableList, 0);
+          pInfo->pRes->info.groupId = pKeyInfo->groupId;
+
+          code = addTagPseudoColumnData(&pInfo->readHandle, pSup->pExprInfo, pSup->numOfExprs, pInfo->pRes,
+                                 GET_TASKID(pTaskInfo));
+          if  (code != TSDB_CODE_SUCCESS) {
+            pTaskInfo->code = code;
+            return NULL;
+          }
+        }
+
+        tsdbLastrowReaderClose(pInfo->pLastrowReader);
+        return pInfo->pRes;
+      }
    }

    doSetOperatorCompleted(pOperator);
-    return (pInfo->pRes->info.rows == 0) ? NULL : pInfo->pRes;
-  } else {
-    // todo fetch the result for each group
+    return NULL;
  }
-
-  return pInfo->pRes->info.rows == 0 ? NULL : pInfo->pRes;
 }

 void destroyLastrowScanOperator(void* param, int32_t numOfOutput) {
  SLastrowScanInfo* pInfo = (SLastrowScanInfo*)param;
  blockDataDestroy(pInfo->pRes);
-  tsdbLastrowReaderClose(pInfo->pLastrowReader);
-
  taosMemoryFreeClear(param);
 }


--- a/source/libs/executor/src/executil.c
+++ b/source/libs/executor/src/executil.c
@@ -65,7 +65,7 @@ size_t getResultRowSize(SqlFunctionCtx* pCtx, int32_t numOfOutput) {
  }

  rowSize +=
-      (numOfOutput * sizeof(bool));  // expand rowSize to mark if col is null for top/bottom result(saveTupleData)
+      (numOfOutput * sizeof(bool));  // expand rowSize to mark if col is null for top/bottom result(doSaveTupleData)
  return rowSize;
 }


--- a/source/libs/executor/src/executorMain.c
+++ b/source/libs/executor/src/executorMain.c
@@ -261,6 +261,15 @@ int32_t qStreamInput(qTaskInfo_t tinfo, void* pItem) {
 }
 #endif

+int32_t qStreamPrepareRecover(qTaskInfo_t tinfo, int64_t startVer, int64_t endVer) {
+  SExecTaskInfo* pTaskInfo = (SExecTaskInfo*)tinfo;
+  ASSERT(pTaskInfo->execModel == OPTR_EXEC_MODEL_STREAM);
+  pTaskInfo->streamInfo.recoverStartVer = startVer;
+  pTaskInfo->streamInfo.recoverEndVer = endVer;
+  pTaskInfo->streamInfo.recoverStep = STREAM_RECOVER_STEP__PREPARE;
+  return 0;
+}
+
 void* qExtractReaderFromStreamScanner(void* scanner) {
  SStreamScanInfo* pInfo = scanner;
  return (void*)pInfo->tqReader;

--- a/source/libs/executor/src/executorimpl.c
+++ b/source/libs/executor/src/executorimpl.c
@@ -514,8 +514,10 @@ static int32_t doSetInputDataBlock(SOperatorInfo* pOperator, SqlFunctionCtx* pCt
        pInput->startRowIndex = 0;

        // NOTE: the last parameter is the primary timestamp column
+        // todo: refactor this
        if (fmIsTimelineFunc(pCtx[i].functionId) && (j == pOneExpr->base.numOfParams - 1)) {
-          pInput->pPTS = pInput->pData[j];
+          pInput->pPTS = pInput->pData[j];   // in case of merge function, this is not always the ts column data.
+//          ASSERT(pInput->pPTS->info.type == TSDB_DATA_TYPE_TIMESTAMP);
        }
        ASSERT(pInput->pData[j] != NULL);
      } else if (pFuncParam->type == FUNC_PARAM_TYPE_VALUE) {
@@ -4291,6 +4293,7 @@ int32_t generateGroupIdMap(STableListInfo* pTableListInfo, SReadHandle* pHandle,
        }
      }
    }
+
    int32_t  len = (int32_t)(pStart - (char*)keyBuf);
    uint64_t groupId = calcGroupId(keyBuf, len);
    taosHashPut(pTableListInfo->map, &(info->uid), sizeof(uint64_t), &groupId, sizeof(uint64_t));
@@ -4309,6 +4312,30 @@ int32_t generateGroupIdMap(STableListInfo* pTableListInfo, SReadHandle* pHandle,
  return TDB_CODE_SUCCESS;
 }

+static int32_t initTableblockDistQueryCond(uint64_t uid, SQueryTableDataCond* pCond) {
+  memset(pCond, 0, sizeof(SQueryTableDataCond));
+
+  pCond->order = TSDB_ORDER_ASC;
+  pCond->numOfCols = 1;
+  pCond->colList = taosMemoryCalloc(1, sizeof(SColumnInfo));
+  if (pCond->colList == NULL) {
+    terrno = TSDB_CODE_QRY_OUT_OF_MEMORY;
+    return terrno;
+  }
+
+  pCond->colList->colId = 1;
+  pCond->colList->type = TSDB_DATA_TYPE_TIMESTAMP;
+  pCond->colList->bytes = sizeof(TSKEY);
+
+  pCond->twindows = (STimeWindow){.skey = INT64_MIN, .ekey = INT64_MAX};
+  pCond->suid = uid;
+  pCond->type = BLOCK_LOAD_OFFSET_ORDER;
+  pCond->startVersion = -1;
+  pCond->endVersion  =  -1;
+
+  return TSDB_CODE_SUCCESS;
+}
+
 SOperatorInfo* createOperatorTree(SPhysiNode* pPhyNode, SExecTaskInfo* pTaskInfo, SReadHandle* pHandle,
                                  uint64_t queryId, uint64_t taskId, STableListInfo* pTableListInfo,
                                  const char* pUser) {
@@ -4318,7 +4345,8 @@ SOperatorInfo* createOperatorTree(SPhysiNode* pPhyNode, SExecTaskInfo* pTaskInfo
    if (QUERY_NODE_PHYSICAL_PLAN_TABLE_SCAN == type) {
      STableScanPhysiNode* pTableScanNode = (STableScanPhysiNode*)pPhyNode;

-      int32_t code = createScanTableListInfo(pTableScanNode, pHandle, pTableListInfo, queryId, taskId);
+      int32_t code = createScanTableListInfo(&pTableScanNode->scan, pTableScanNode->pGroupTags,
+                                             pTableScanNode->groupSort, pHandle, pTableListInfo, queryId, taskId);
      if (code) {
        pTaskInfo->code = code;
        return NULL;
@@ -4337,7 +4365,8 @@ SOperatorInfo* createOperatorTree(SPhysiNode* pPhyNode, SExecTaskInfo* pTaskInfo

    } else if (QUERY_NODE_PHYSICAL_PLAN_TABLE_MERGE_SCAN == type) {
      STableMergeScanPhysiNode* pTableScanNode = (STableMergeScanPhysiNode*)pPhyNode;
-      int32_t code = createScanTableListInfo(pTableScanNode, pHandle, pTableListInfo, queryId, taskId);
+      int32_t code = createScanTableListInfo(&pTableScanNode->scan, pTableScanNode->pGroupTags,
+                                             pTableScanNode->groupSort, pHandle, pTableListInfo, queryId, taskId);
      if (code) {
        pTaskInfo->code = code;
        return NULL;
@@ -4366,7 +4395,8 @@ SOperatorInfo* createOperatorTree(SPhysiNode* pPhyNode, SExecTaskInfo* pTaskInfo
            .maxTs = INT64_MIN,
      };
      if (pHandle) {
-        int32_t code = createScanTableListInfo(pTableScanNode, pHandle, pTableListInfo, queryId, taskId);
+        int32_t code = createScanTableListInfo(&pTableScanNode->scan, pTableScanNode->pGroupTags,
+                                               pTableScanNode->groupSort, pHandle, pTableListInfo, queryId, taskId);
        if (code) {
          pTaskInfo->code = code;
          return NULL;
@@ -4406,25 +4436,9 @@ SOperatorInfo* createOperatorTree(SPhysiNode* pPhyNode, SExecTaskInfo* pTaskInfo
      }

      SQueryTableDataCond cond = {0};
-
-      {
-        cond.order = TSDB_ORDER_ASC;
-        cond.numOfCols = 1;
-        cond.colList = taosMemoryCalloc(1, sizeof(SColumnInfo));
-        if (cond.colList == NULL) {
-          terrno = TSDB_CODE_QRY_OUT_OF_MEMORY;
-          return NULL;
-        }
-
-        cond.colList->colId = 1;
-        cond.colList->type = TSDB_DATA_TYPE_TIMESTAMP;
-        cond.colList->bytes = sizeof(TSKEY);
-
-        cond.twindows = (STimeWindow){.skey = INT64_MIN, .ekey = INT64_MAX};
-        cond.suid = pBlockNode->suid;
-        cond.type = BLOCK_LOAD_OFFSET_ORDER;
-        cond.startVersion = -1;
-        cond.endVersion  =  -1;
+      int32_t code = initTableblockDistQueryCond(pBlockNode->suid, &cond);
+      if (code != TSDB_CODE_SUCCESS) {
+        return NULL;
      }

      STsdbReader* pReader = NULL;
@@ -4435,31 +4449,20 @@ SOperatorInfo* createOperatorTree(SPhysiNode* pPhyNode, SExecTaskInfo* pTaskInfo
    } else if (QUERY_NODE_PHYSICAL_PLAN_LAST_ROW_SCAN == type) {
      SLastRowScanPhysiNode* pScanNode = (SLastRowScanPhysiNode*)pPhyNode;

-      //      int32_t code = createScanTableListInfo(pTableScanNode, pHandle, pTableListInfo, queryId, taskId);
-      //      if (code) {
-      //        pTaskInfo->code = code;
-      //        return NULL;
-      //      }
-
-      int32_t code = extractTableSchemaInfo(pHandle, pScanNode->scan.uid, pTaskInfo);
+      int32_t code = createScanTableListInfo(&pScanNode->scan, pScanNode->pGroupTags, true, pHandle, pTableListInfo,
+                                             queryId, taskId);
      if (code != TSDB_CODE_SUCCESS) {
        pTaskInfo->code = code;
        return NULL;
      }

-      pTableListInfo->pTableList = taosArrayInit(4, sizeof(STableKeyInfo));
-      if (pScanNode->scan.tableType == TSDB_SUPER_TABLE) {
-        code = vnodeGetAllTableList(pHandle->vnode, pScanNode->scan.uid, pTableListInfo->pTableList);
-        if (code != TSDB_CODE_SUCCESS) {
-          pTaskInfo->code = terrno;
-          return NULL;
-        }
-      } else {  // Create one table group.
-        STableKeyInfo info = {.lastKey = 0, .uid = pScanNode->scan.uid, .groupId = 0};
-        taosArrayPush(pTableListInfo->pTableList, &info);
+      code = extractTableSchemaInfo(pHandle, pScanNode->scan.uid, pTaskInfo);
+      if (code != TSDB_CODE_SUCCESS) {
+        pTaskInfo->code = code;
+        return NULL;
      }

-      return createLastrowScanOperator(pScanNode, pHandle, pTableListInfo->pTableList, pTaskInfo);
+      return createLastrowScanOperator(pScanNode, pHandle, pTaskInfo);
    } else {
      ASSERT(0);
    }
@@ -4928,6 +4931,9 @@ static void doDestroyTableList(STableListInfo* pTableqinfoList) {
  if (pTableqinfoList->needSortTableByGroupId) {
    for (int32_t i = 0; i < taosArrayGetSize(pTableqinfoList->pGroupList); i++) {
      SArray* tmp = taosArrayGetP(pTableqinfoList->pGroupList, i);
+      if (tmp == pTableqinfoList->pTableList) {
+        continue;
+      }
      taosArrayDestroy(tmp);
    }
  }

--- a/source/libs/executor/src/scanoperator.c
+++ b/source/libs/executor/src/scanoperator.c
@@ -516,10 +516,14 @@ static SSDataBlock* doTableScan(SOperatorInfo* pOperator) {
    }

    SArray* tableList = taosArrayGetP(pTaskInfo->tableqinfoList.pGroupList, pInfo->currentGroupId);
+
    tsdbReaderClose(pInfo->dataReader);

    int32_t code = tsdbReaderOpen(pInfo->readHandle.vnode, &pInfo->cond, tableList, (STsdbReader**)&pInfo->dataReader,
                                  GET_TASKID(pTaskInfo));
+    if (code != 0) {
+      // TODO
+    }
  }

  SSDataBlock* result = doTableScanGroup(pOperator);
@@ -871,6 +875,7 @@ static bool prepareRangeScan(SStreamScanInfo* pInfo, SSDataBlock* pBlock, int32_
  }

  resetTableScanInfo(pInfo->pTableScanOp->info, &win);
+  pInfo->pTableScanOp->status = OP_OPENED;
  return true;
 }

@@ -1193,8 +1198,6 @@ static int32_t setBlockIntoRes(SStreamScanInfo* pInfo, const SSDataBlock* pBlock
    }
  }

-  ASSERT(pInfo->pRes->pDataBlock != NULL);
-
  // currently only the tbname pseudo column
  if (pInfo->numOfPseudoExpr > 0) {
    int32_t code = addTagPseudoColumnData(&pInfo->readHandle, pInfo->pPseudoExpr, pInfo->numOfPseudoExpr, pInfo->pRes,
@@ -1259,6 +1262,24 @@ static SSDataBlock* doStreamScan(SOperatorInfo* pOperator) {
    return NULL;
  }

+  if (pTaskInfo->streamInfo.recoverStep == STREAM_RECOVER_STEP__PREPARE) {
+    STableScanInfo* pTSInfo = pInfo->pTableScanOp->info;
+    memcpy(&pTSInfo->cond, &pTaskInfo->streamInfo.tableCond, sizeof(SQueryTableDataCond));
+    pTSInfo->scanTimes = 0;
+    pTSInfo->currentGroupId = -1;
+    pTaskInfo->streamInfo.recoverStep = STREAM_RECOVER_STEP__SCAN;
+  }
+
+  if (pTaskInfo->streamInfo.recoverStep == STREAM_RECOVER_STEP__SCAN) {
+    SSDataBlock* pBlock = doTableScan(pInfo->pTableScanOp);
+    if (pBlock != NULL) {
+      return pBlock;
+    }
+    // TODO fill in bloom filter
+    pTaskInfo->streamInfo.recoverStep = STREAM_RECOVER_STEP__NONE;
+    return NULL;
+  }
+
  size_t total = taosArrayGetSize(pInfo->pBlockLists);
  // TODO: refactor
  if (pInfo->blockType == STREAM_INPUT__DATA_BLOCK) {
@@ -1551,6 +1572,7 @@ SOperatorInfo* createStreamScanOperatorInfo(SReadHandle* pHandle, STableScanPhys
      goto _error;
    }
    taosArrayDestroy(tableIdList);
+    memcpy(&pTaskInfo->streamInfo.tableCond, &pTSInfo->cond, sizeof(SQueryTableDataCond));
  }

  // create the pseduo columns info
@@ -2402,9 +2424,9 @@ typedef struct STableMergeScanInfo {
  SSampleExecInfo sample;  // sample execution info
 } STableMergeScanInfo;

-int32_t createScanTableListInfo(STableScanPhysiNode* pTableScanNode, SReadHandle* pHandle,
+int32_t createScanTableListInfo(SScanPhysiNode* pScanNode, SNodeList* pGroupTags, bool groupSort, SReadHandle* pHandle,
                                STableListInfo* pTableListInfo, uint64_t queryId, uint64_t taskId) {
-  int32_t code = getTableList(pHandle->meta, pHandle->vnode, &pTableScanNode->scan, pTableListInfo);
+  int32_t code = getTableList(pHandle->meta, pHandle->vnode, pScanNode, pTableListInfo);
  if (code != TSDB_CODE_SUCCESS) {
    return code;
  }
@@ -2414,8 +2436,8 @@ int32_t createScanTableListInfo(STableScanPhysiNode* pTableScanNode, SReadHandle
    return TSDB_CODE_SUCCESS;
  }

-  pTableListInfo->needSortTableByGroupId = pTableScanNode->groupSort;
-  code = generateGroupIdMap(pTableListInfo, pHandle, pTableScanNode->pGroupTags);
+  pTableListInfo->needSortTableByGroupId = groupSort;
+  code = generateGroupIdMap(pTableListInfo, pHandle, pGroupTags);
  if (code != TSDB_CODE_SUCCESS) {
    return code;
  }

--- a/source/libs/function/inc/builtinsimpl.h
+++ b/source/libs/function/inc/builtinsimpl.h
@@ -106,7 +106,7 @@ bool irateFuncSetup(SqlFunctionCtx *pCtx, SResultRowEntryInfo* pResInfo);
 int32_t irateFunction(SqlFunctionCtx *pCtx);
 int32_t irateFinalize(SqlFunctionCtx* pCtx, SSDataBlock* pBlock);

-int32_t cacheLastRowFunction(SqlFunctionCtx* pCtx);
+int32_t cachedLastRowFunction(SqlFunctionCtx* pCtx);

 bool getFirstLastFuncEnv(struct SFunctionNode* pFunc, SFuncExecEnv* pEnv);
 int32_t firstFunction(SqlFunctionCtx *pCtx);

--- a/source/libs/function/src/builtins.c
+++ b/source/libs/function/src/builtins.c
@@ -1981,6 +1981,7 @@ const SBuiltinFuncDefinition funcMgtBuiltins[] = {
    .getEnvFunc   = getLeastSQRFuncEnv,
    .initFunc     = leastSQRFunctionSetup,
    .processFunc  = leastSQRFunction,
+    .sprocessFunc = leastSQRScalarFunction,
    .finalizeFunc = leastSQRFinalize,
    .invertFunc   = NULL,
    .combineFunc  = leastSQRCombine,
@@ -2228,11 +2229,11 @@ const SBuiltinFuncDefinition funcMgtBuiltins[] = {
  {
    .name = "_cache_last_row",
    .type = FUNCTION_TYPE_CACHE_LAST_ROW,
-    .classification = FUNC_MGT_AGG_FUNC | FUNC_MGT_MULTI_RES_FUNC | FUNC_MGT_TIMELINE_FUNC | FUNC_MGT_IMPLICIT_TS_FUNC,
+    .classification = FUNC_MGT_AGG_FUNC | FUNC_MGT_MULTI_RES_FUNC | FUNC_MGT_SELECT_FUNC | FUNC_MGT_TIMELINE_FUNC | FUNC_MGT_IMPLICIT_TS_FUNC,
    .translateFunc = translateFirstLast,
    .getEnvFunc   = getFirstLastFuncEnv,
    .initFunc     = functionSetup,
-    .processFunc  = cacheLastRowFunction,
+    .processFunc  = cachedLastRowFunction,
    .finalizeFunc = firstLastFinalize
  },
  {

--- a/source/libs/function/src/builtinsimpl.c
+++ b/source/libs/function/src/builtinsimpl.c
--- a/source/libs/function/src/udfd.c
+++ b/source/libs/function/src/udfd.c
@@ -382,6 +382,15 @@ void udfdProcessRpcRsp(void *parent, SRpcMsg *pMsg, SEpSet *pEpSet) {
  if (msgInfo->rpcType == UDFD_RPC_MNODE_CONNECT) {
    SConnectRsp connectRsp = {0};
    tDeserializeSConnectRsp(pMsg->pCont, pMsg->contLen, &connectRsp);
+    
+    int32_t now = taosGetTimestampSec();
+    int32_t delta = abs(now - connectRsp.svrTimestamp);
+    if (delta > 900) {
+      msgInfo->code = TSDB_CODE_TIME_UNSYNCED;
+      goto _return;
+    }
+    
+     
    if (connectRsp.epSet.numOfEps == 0) {
      msgInfo->code = TSDB_CODE_MND_APP_ERROR;
      goto _return;

--- a/source/libs/index/src/indexCache.c
+++ b/source/libs/index/src/indexCache.c
@@ -516,13 +516,14 @@ static void idxCacheMakeRoomForWrite(IndexCache* cache) {
      idxCacheRef(cache);
      cache->imm = cache->mem;
      cache->mem = idxInternalCacheCreate(cache->type);
+
      cache->mem->pCache = cache;
      cache->occupiedMem = 0;
      if (quit == false) {
        atomic_store_32(&cache->merging, 1);
      }
-      // sched to merge
-      // unref cache in bgwork
+      // 1. sched to merge
+      // 2. unref cache in bgwork
      idxCacheSchedToMerge(cache, quit);
    }
  }

--- a/source/libs/nodes/src/nodesUtilFuncs.c
+++ b/source/libs/nodes/src/nodesUtilFuncs.c
@@ -388,6 +388,11 @@ static void destroyDataSinkNode(SDataSinkNode* pNode) { nodesDestroyNode((SNode*

 static void destroyExprNode(SExprNode* pExpr) { taosArrayDestroy(pExpr->pAssociation); }

+static void nodesDestroyNodePointer(void* node) {
+  SNode* pNode = *(SNode**)node;
+  nodesDestroyNode(pNode);
+}
+
 void nodesDestroyNode(SNode* pNode) {
  if (NULL == pNode) {
    return;
@@ -718,6 +723,7 @@ void nodesDestroyNode(SNode* pNode) {
      }
      taosArrayDestroy(pQuery->pDbList);
      taosArrayDestroy(pQuery->pTableList);
+      taosArrayDestroyEx(pQuery->pPlaceholderValues, nodesDestroyNodePointer);
      break;
    }
    case QUERY_NODE_LOGIC_PLAN_SCAN: {

--- a/source/libs/parser/src/parAstParser.c
+++ b/source/libs/parser/src/parAstParser.c
@@ -118,36 +118,33 @@ static bool needGetTableIndex(SNode* pStmt) {
  return false;
 }

-static int32_t collectMetaKeyFromRealTableImpl(SCollectMetaKeyCxt* pCxt, SRealTableNode* pRealTable,
+static int32_t collectMetaKeyFromRealTableImpl(SCollectMetaKeyCxt* pCxt, const char* pDb, const char* pTable,
                                               AUTH_TYPE authType) {
-  int32_t code = reserveTableMetaInCache(pCxt->pParseCxt->acctId, pRealTable->table.dbName, pRealTable->table.tableName,
-                                         pCxt->pMetaCache);
+  int32_t code = reserveTableMetaInCache(pCxt->pParseCxt->acctId, pDb, pTable, pCxt->pMetaCache);
  if (TSDB_CODE_SUCCESS == code) {
-    code = reserveTableVgroupInCache(pCxt->pParseCxt->acctId, pRealTable->table.dbName, pRealTable->table.tableName,
-                                     pCxt->pMetaCache);
+    code = reserveTableVgroupInCache(pCxt->pParseCxt->acctId, pDb, pTable, pCxt->pMetaCache);
  }
  if (TSDB_CODE_SUCCESS == code) {
-    code = reserveUserAuthInCache(pCxt->pParseCxt->acctId, pCxt->pParseCxt->pUser, pRealTable->table.dbName, authType,
-                                  pCxt->pMetaCache);
+    code = reserveUserAuthInCache(pCxt->pParseCxt->acctId, pCxt->pParseCxt->pUser, pDb, authType, pCxt->pMetaCache);
  }
  if (TSDB_CODE_SUCCESS == code) {
-    code = reserveDbVgInfoInCache(pCxt->pParseCxt->acctId, pRealTable->table.dbName, pCxt->pMetaCache);
+    code = reserveDbVgInfoInCache(pCxt->pParseCxt->acctId, pDb, pCxt->pMetaCache);
  }
  if (TSDB_CODE_SUCCESS == code && needGetTableIndex(pCxt->pStmt)) {
-    code = reserveTableIndexInCache(pCxt->pParseCxt->acctId, pRealTable->table.dbName, pRealTable->table.tableName,
-                                    pCxt->pMetaCache);
+    code = reserveTableIndexInCache(pCxt->pParseCxt->acctId, pDb, pTable, pCxt->pMetaCache);
  }
-  if (TSDB_CODE_SUCCESS == code && (0 == strcmp(pRealTable->table.tableName, TSDB_INS_TABLE_DNODE_VARIABLES))) {
+  if (TSDB_CODE_SUCCESS == code && (0 == strcmp(pTable, TSDB_INS_TABLE_DNODE_VARIABLES))) {
    code = reserveDnodeRequiredInCache(pCxt->pMetaCache);
  }
  if (TSDB_CODE_SUCCESS == code) {
-    code = reserveDbCfgInCache(pCxt->pParseCxt->acctId, pRealTable->table.dbName, pCxt->pMetaCache);
+    code = reserveDbCfgInCache(pCxt->pParseCxt->acctId, pDb, pCxt->pMetaCache);
  }
  return code;
 }

 static EDealRes collectMetaKeyFromRealTable(SCollectMetaKeyFromExprCxt* pCxt, SRealTableNode* pRealTable) {
-  pCxt->errCode = collectMetaKeyFromRealTableImpl(pCxt->pComCxt, pRealTable, AUTH_TYPE_READ);
+  pCxt->errCode = collectMetaKeyFromRealTableImpl(pCxt->pComCxt, pRealTable->table.dbName, pRealTable->table.tableName,
+                                                  AUTH_TYPE_READ);
  return TSDB_CODE_SUCCESS == pCxt->errCode ? DEAL_RES_CONTINUE : DEAL_RES_ERROR;
 }

@@ -454,11 +451,13 @@ static int32_t collectMetaKeyFromShowTransactions(SCollectMetaKeyCxt* pCxt, SSho
 }

 static int32_t collectMetaKeyFromDelete(SCollectMetaKeyCxt* pCxt, SDeleteStmt* pStmt) {
-  return collectMetaKeyFromRealTableImpl(pCxt, (SRealTableNode*)pStmt->pFromTable, AUTH_TYPE_WRITE);
+  STableNode* pTable = (STableNode*)pStmt->pFromTable;
+  return collectMetaKeyFromRealTableImpl(pCxt, pTable->dbName, pTable->tableName, AUTH_TYPE_WRITE);
 }

 static int32_t collectMetaKeyFromInsert(SCollectMetaKeyCxt* pCxt, SInsertStmt* pStmt) {
-  int32_t code = collectMetaKeyFromRealTableImpl(pCxt, (SRealTableNode*)pStmt->pTable, AUTH_TYPE_WRITE);
+  STableNode* pTable = (STableNode*)pStmt->pTable;
+  int32_t     code = collectMetaKeyFromRealTableImpl(pCxt, pTable->dbName, pTable->tableName, AUTH_TYPE_WRITE);
  if (TSDB_CODE_SUCCESS == code) {
    code = collectMetaKeyFromQuery(pCxt, pStmt->pQuery);
  }
@@ -471,14 +470,7 @@ static int32_t collectMetaKeyFromShowBlockDist(SCollectMetaKeyCxt* pCxt, SShowTa
  strcpy(name.tname, pStmt->tableName);
  int32_t code = catalogRemoveTableMeta(pCxt->pParseCxt->pCatalog, &name);
  if (TSDB_CODE_SUCCESS == code) {
-    code = reserveTableMetaInCache(pCxt->pParseCxt->acctId, pStmt->dbName, pStmt->tableName, pCxt->pMetaCache);
-  }
-
-  if (TSDB_CODE_SUCCESS == code) {
-    code = reserveTableVgroupInCache(pCxt->pParseCxt->acctId, pStmt->dbName, pStmt->tableName, pCxt->pMetaCache);
-  }
-  if (TSDB_CODE_SUCCESS == code) {
-    code = reserveDbVgInfoInCache(pCxt->pParseCxt->acctId, pStmt->dbName, pCxt->pMetaCache);
+    code = collectMetaKeyFromRealTableImpl(pCxt, pStmt->dbName, pStmt->tableName, AUTH_TYPE_READ);
  }
  return code;
 }

--- a/source/libs/parser/test/mockCatalog.cpp
+++ b/source/libs/parser/test/mockCatalog.cpp
@@ -159,8 +159,8 @@ void generatePerformanceSchema(MockCatalogService* mcs) {
 *          c4         |       column       |       DOUBLE       |    8     |
 *          c5         |       column       |       DOUBLE       |    8     |
 */
-void generateTestTables(MockCatalogService* mcs) {
-  ITableBuilder& builder = mcs->createTableBuilder("test", "t1", TSDB_NORMAL_TABLE, 6)
+void generateTestTables(MockCatalogService* mcs, const std::string& db) {
+  ITableBuilder& builder = mcs->createTableBuilder(db, "t1", TSDB_NORMAL_TABLE, 6)
                               .setPrecision(TSDB_TIME_PRECISION_MILLI)
                               .setVgid(1)
                               .addColumn("ts", TSDB_DATA_TYPE_TIMESTAMP)
@@ -193,9 +193,9 @@ void generateTestTables(MockCatalogService* mcs) {
 *         jtag        |        tag         |        json        |    --    |
 * Child Table: st2s1, st2s2
 */
-void generateTestStables(MockCatalogService* mcs) {
+void generateTestStables(MockCatalogService* mcs, const std::string& db) {
  {
-    ITableBuilder& builder = mcs->createTableBuilder("test", "st1", TSDB_SUPER_TABLE, 3, 3)
+    ITableBuilder& builder = mcs->createTableBuilder(db, "st1", TSDB_SUPER_TABLE, 3, 3)
                                 .setPrecision(TSDB_TIME_PRECISION_MILLI)
                                 .addColumn("ts", TSDB_DATA_TYPE_TIMESTAMP)
                                 .addColumn("c1", TSDB_DATA_TYPE_INT)
@@ -204,20 +204,20 @@ void generateTestStables(MockCatalogService* mcs) {
                                 .addTag("tag2", TSDB_DATA_TYPE_BINARY, 20)
                                 .addTag("tag3", TSDB_DATA_TYPE_TIMESTAMP);
    builder.done();
-    mcs->createSubTable("test", "st1", "st1s1", 1);
-    mcs->createSubTable("test", "st1", "st1s2", 2);
-    mcs->createSubTable("test", "st1", "st1s3", 1);
+    mcs->createSubTable(db, "st1", "st1s1", 1);
+    mcs->createSubTable(db, "st1", "st1s2", 2);
+    mcs->createSubTable(db, "st1", "st1s3", 1);
  }
  {
-    ITableBuilder& builder = mcs->createTableBuilder("test", "st2", TSDB_SUPER_TABLE, 3, 1)
+    ITableBuilder& builder = mcs->createTableBuilder(db, "st2", TSDB_SUPER_TABLE, 3, 1)
                                 .setPrecision(TSDB_TIME_PRECISION_MILLI)
                                 .addColumn("ts", TSDB_DATA_TYPE_TIMESTAMP)
                                 .addColumn("c1", TSDB_DATA_TYPE_INT)
                                 .addColumn("c2", TSDB_DATA_TYPE_BINARY, 20)
                                 .addTag("jtag", TSDB_DATA_TYPE_JSON);
    builder.done();
-    mcs->createSubTable("test", "st2", "st2s1", 1);
-    mcs->createSubTable("test", "st2", "st2s2", 2);
+    mcs->createSubTable(db, "st2", "st2s1", 1);
+    mcs->createSubTable(db, "st2", "st2s2", 2);
  }
 }

@@ -237,6 +237,11 @@ void generateDatabases(MockCatalogService* mcs) {
  mcs->createDatabase(TSDB_INFORMATION_SCHEMA_DB);
  mcs->createDatabase(TSDB_PERFORMANCE_SCHEMA_DB);
  mcs->createDatabase("test");
+  generateTestTables(g_mockCatalogService.get(), "test");
+  generateTestStables(g_mockCatalogService.get(), "test");
+  mcs->createDatabase("cache_db", false, 1);
+  generateTestTables(g_mockCatalogService.get(), "cache_db");
+  generateTestStables(g_mockCatalogService.get(), "cache_db");
  mcs->createDatabase("rollup_db", true);
 }

@@ -369,11 +374,8 @@ void generateMetaData() {
  generateDatabases(g_mockCatalogService.get());
  generateInformationSchema(g_mockCatalogService.get());
  generatePerformanceSchema(g_mockCatalogService.get());
-  generateTestTables(g_mockCatalogService.get());
-  generateTestStables(g_mockCatalogService.get());
  generateFunctions(g_mockCatalogService.get());
  generateDnodes(g_mockCatalogService.get());
-  g_mockCatalogService->showTables();
 }

 void destroyMetaDataEnv() { g_mockCatalogService.reset(); }
--- a/source/libs/parser/test/mockCatalogService.cpp
+++ b/source/libs/parser/test/mockCatalogService.cpp
@@ -334,11 +334,12 @@ class MockCatalogServiceImpl {
    dnode_.insert(std::make_pair(dnodeId, epSet));
  }

-  void createDatabase(const std::string& db, bool rollup) {
+  void createDatabase(const std::string& db, bool rollup, int8_t cacheLast) {
    SDbCfgInfo cfg = {0};
    if (rollup) {
      cfg.pRetensions = taosArrayInit(TARRAY_MIN_SIZE, sizeof(SRetention));
    }
+    cfg.cacheLast = cacheLast;
    dbCfg_.insert(std::make_pair(db, cfg));
  }

@@ -627,7 +628,9 @@ void MockCatalogService::createDnode(int32_t dnodeId, const std::string& host, i
  impl_->createDnode(dnodeId, host, port);
 }

-void MockCatalogService::createDatabase(const std::string& db, bool rollup) { impl_->createDatabase(db, rollup); }
+void MockCatalogService::createDatabase(const std::string& db, bool rollup, int8_t cacheLast) {
+  impl_->createDatabase(db, rollup, cacheLast);
+}

 int32_t MockCatalogService::catalogGetTableMeta(const SName* pTableName, STableMeta** pTableMeta) const {
  return impl_->catalogGetTableMeta(pTableName, pTableMeta);

--- a/source/libs/parser/test/mockCatalogService.h
+++ b/source/libs/parser/test/mockCatalogService.h
@@ -63,7 +63,7 @@ class MockCatalogService {
  void createFunction(const std::string& func, int8_t funcType, int8_t outputType, int32_t outputLen, int32_t bufSize);
  void createSmaIndex(const SMCreateSmaReq* pReq);
  void createDnode(int32_t dnodeId, const std::string& host, int16_t port);
-  void createDatabase(const std::string& db, bool rollup = false);
+  void createDatabase(const std::string& db, bool rollup = false, int8_t cacheLast = 0);

  int32_t catalogGetTableMeta(const SName* pTableName, STableMeta** pTableMeta) const;
  int32_t catalogGetTableHashVgroup(const SName* pTableName, SVgroupInfo* vgInfo) const;

--- a/source/libs/parser/test/parShowToUse.cpp
+++ b/source/libs/parser/test/parShowToUse.cpp
@@ -179,6 +179,12 @@ TEST_F(ParserShowToUseTest, showTables) {
  run("SHOW test.tables like 'c%'");
 }

+TEST_F(ParserShowToUseTest, showTableDistributed) {
+  useDb("root", "test");
+
+  run("SHOW TABLE DISTRIBUTED st1");
+}
+
 // todo SHOW topics

 TEST_F(ParserShowToUseTest, showUsers) {

--- a/source/libs/planner/src/planOptimizer.c
+++ b/source/libs/planner/src/planOptimizer.c
@@ -1993,7 +1993,8 @@ static bool lastRowScanOptMayBeOptimized(SLogicNode* pNode) {
  SNode* pFunc = NULL;
  FOREACH(pFunc, ((SAggLogicNode*)pNode)->pAggFuncs) {
    if (FUNCTION_TYPE_LAST_ROW != ((SFunctionNode*)pFunc)->funcType &&
-        FUNCTION_TYPE_SELECT_VALUE != ((SFunctionNode*)pFunc)->funcType) {
+        FUNCTION_TYPE_SELECT_VALUE != ((SFunctionNode*)pFunc)->funcType &&
+        FUNCTION_TYPE_GROUP_KEY != ((SFunctionNode*)pFunc)->funcType) {
      return false;
    }
  }
@@ -2011,11 +2012,13 @@ static int32_t lastRowScanOptimize(SOptimizeContext* pCxt, SLogicSubplan* pLogic
  SNode* pNode = NULL;
  FOREACH(pNode, pAgg->pAggFuncs) {
    SFunctionNode* pFunc = (SFunctionNode*)pNode;
-    int32_t        len = snprintf(pFunc->functionName, sizeof(pFunc->functionName), "_cache_last_row");
-    pFunc->functionName[len] = '\0';
-    int32_t code = fmGetFuncInfo(pFunc, NULL, 0);
-    if (TSDB_CODE_SUCCESS != code) {
-      return code;
+    if (FUNCTION_TYPE_LAST_ROW == pFunc->funcType) {
+      int32_t len = snprintf(pFunc->functionName, sizeof(pFunc->functionName), "_cache_last_row");
+      pFunc->functionName[len] = '\0';
+      int32_t code = fmGetFuncInfo(pFunc, NULL, 0);
+      if (TSDB_CODE_SUCCESS != code) {
+        return code;
+      }
    }
  }
  pAgg->hasLastRow = false;

--- a/source/libs/planner/test/planBasicTest.cpp
+++ b/source/libs/planner/test/planBasicTest.cpp
@@ -98,6 +98,24 @@ TEST_F(PlanBasicTest, interpFunc) {
 }

 TEST_F(PlanBasicTest, lastRowFunc) {
+  useDb("root", "cache_db");
+
+  run("SELECT LAST_ROW(c1) FROM t1");
+
+  run("SELECT LAST_ROW(*) FROM t1");
+
+  run("SELECT LAST_ROW(c1, c2) FROM t1");
+
+  run("SELECT LAST_ROW(c1), c2 FROM t1");
+
+  run("SELECT LAST_ROW(c1) FROM st1");
+
+  run("SELECT LAST_ROW(c1) FROM st1 PARTITION BY TBNAME");
+
+  run("SELECT LAST_ROW(c1), SUM(c3) FROM t1");
+}
+
+TEST_F(PlanBasicTest, lastRowFuncWithoutCache) {
  useDb("root", "test");

  run("SELECT LAST_ROW(c1) FROM t1");

--- a/source/libs/qworker/inc/qwInt.h
+++ b/source/libs/qworker/inc/qwInt.h
@@ -378,6 +378,7 @@ void qwDbgDumpMgmtInfo(SQWorker *mgmt);
 int32_t qwDbgValidateStatus(QW_FPARAMS_DEF, int8_t oriStatus, int8_t newStatus, bool *ignore);
 int32_t qwDbgBuildAndSendRedirectRsp(int32_t rspType, SRpcHandleInfo *pConn, int32_t code, SEpSet *pEpSet);
 int32_t qwAddTaskCtx(QW_FPARAMS_DEF);
+int32_t qwDbgResponseRedirect(SQWMsg *qwMsg, SQWTaskCtx *ctx);


 #ifdef __cplusplus

--- a/source/libs/qworker/inc/qwMsg.h
+++ b/source/libs/qworker/inc/qwMsg.h
@@ -24,7 +24,7 @@ extern "C" {
 #include "dataSinkMgt.h"

 int32_t qwAbortPrerocessQuery(QW_FPARAMS_DEF);
-int32_t qwPrerocessQuery(QW_FPARAMS_DEF, SQWMsg *qwMsg);
+int32_t qwPreprocessQuery(QW_FPARAMS_DEF, SQWMsg *qwMsg);
 int32_t qwProcessQuery(QW_FPARAMS_DEF, SQWMsg *qwMsg, const char* sql);
 int32_t qwProcessCQuery(QW_FPARAMS_DEF, SQWMsg *qwMsg);
 int32_t qwProcessReady(QW_FPARAMS_DEF, SQWMsg *qwMsg);

--- a/source/libs/qworker/src/qwDbg.c
+++ b/source/libs/qworker/src/qwDbg.c
@@ -9,7 +9,7 @@
 #include "tmsg.h"
 #include "tname.h"

-SQWDebug     gQWDebug = {.statusEnable = true, .dumpEnable = false, .tmp = true};
+SQWDebug     gQWDebug = {.statusEnable = true, .dumpEnable = false, .tmp = false};

 int32_t qwDbgValidateStatus(QW_FPARAMS_DEF, int8_t oriStatus, int8_t newStatus, bool *ignore) {
  if (!gQWDebug.statusEnable) {
@@ -147,9 +147,9 @@ int32_t qwDbgBuildAndSendRedirectRsp(int32_t rspType, SRpcHandleInfo *pConn, int
  return TSDB_CODE_SUCCESS;
 }

-int32_t qwDbgResponseREdirect(SQWMsg *qwMsg, SQWTaskCtx *ctx) {
+int32_t qwDbgResponseRedirect(SQWMsg *qwMsg, SQWTaskCtx *ctx) {
  if (gQWDebug.tmp) {
-    if (TDMT_SCH_QUERY == qwMsg->msgType) {
+    if (TDMT_SCH_QUERY == qwMsg->msgType && (0 == taosRand() % 3)) {
      SEpSet epSet = {0};
      epSet.inUse = 1;
      epSet.numOfEps = 3;
@@ -159,16 +159,15 @@ int32_t qwDbgResponseREdirect(SQWMsg *qwMsg, SQWTaskCtx *ctx) {
      epSet.eps[1].port = 7200;
      strcpy(epSet.eps[2].fqdn, "localhost");
      epSet.eps[2].port = 7300;
-      
+
+      ctx->phase = QW_PHASE_POST_QUERY;      
      qwDbgBuildAndSendRedirectRsp(qwMsg->msgType + 1, &qwMsg->connInfo, TSDB_CODE_RPC_REDIRECT, &epSet);
-      gQWDebug.tmp = false;
      return TSDB_CODE_SUCCESS;
    }
    
-    if (TDMT_SCH_MERGE_QUERY == qwMsg->msgType) {
+    if (TDMT_SCH_MERGE_QUERY == qwMsg->msgType && (0 == taosRand() % 3)) {
      ctx->phase = QW_PHASE_POST_QUERY;
      qwDbgBuildAndSendRedirectRsp(qwMsg->msgType + 1, &qwMsg->connInfo, TSDB_CODE_RPC_REDIRECT, NULL);
-      gQWDebug.tmp = false;
      return TSDB_CODE_SUCCESS;
    }
  }

--- a/source/libs/qworker/src/qwMsg.c
+++ b/source/libs/qworker/src/qwMsg.c
@@ -315,10 +315,10 @@ int32_t qWorkerPreprocessQueryMsg(void *qWorkerMgmt, SRpcMsg *pMsg) {
  int64_t  rId = msg->refId;
  int32_t  eId = msg->execId;

-  SQWMsg qwMsg = {.msg = msg->msg + msg->sqlLen, .msgLen = msg->phyLen, .connInfo = pMsg->info};
+  SQWMsg qwMsg = {.msgType = pMsg->msgType, .msg = msg->msg + msg->sqlLen, .msgLen = msg->phyLen, .connInfo = pMsg->info};

  QW_SCH_TASK_DLOG("prerocessQuery start, handle:%p", pMsg->info.handle);
-  QW_ERR_RET(qwPrerocessQuery(QW_FPARAMS(), &qwMsg));
+  QW_ERR_RET(qwPreprocessQuery(QW_FPARAMS(), &qwMsg));
  QW_SCH_TASK_DLOG("prerocessQuery end, handle:%p", pMsg->info.handle);

  return TSDB_CODE_SUCCESS;

--- a/source/libs/qworker/src/qworker.c
+++ b/source/libs/qworker/src/qworker.c
@@ -469,7 +469,7 @@ int32_t qwAbortPrerocessQuery(QW_FPARAMS_DEF) {
 }


-int32_t qwPrerocessQuery(QW_FPARAMS_DEF, SQWMsg *qwMsg) {
+int32_t qwPreprocessQuery(QW_FPARAMS_DEF, SQWMsg *qwMsg) {
  int32_t        code = 0;
  bool           queryRsped = false;
  SSubplan      *plan = NULL;
@@ -488,6 +488,8 @@ int32_t qwPrerocessQuery(QW_FPARAMS_DEF, SQWMsg *qwMsg) {

  QW_ERR_JRET(qwAddTaskStatus(QW_FPARAMS(), JOB_TASK_STATUS_INIT));

+  qwDbgResponseRedirect(qwMsg, ctx);
+
 _return:

  if (ctx) {

--- a/source/libs/scalar/src/sclfunc.c
+++ b/source/libs/scalar/src/sclfunc.c
@@ -2139,3 +2139,171 @@ int32_t stddevScalarFunction(SScalarParam *pInput, int32_t inputNum, SScalarPara
  return TSDB_CODE_SUCCESS;
 }

+#define LEASTSQR_CAL(p, x, y, index, step) \
+  do {                                     \
+    (p)[0][0] += (double)(x) * (x);        \
+    (p)[0][1] += (double)(x);              \
+    (p)[0][2] += (double)(x) * (y)[index]; \
+    (p)[1][2] += (y)[index];               \
+    (x) += step;                           \
+  } while (0)
+
+int32_t leastSQRScalarFunction(SScalarParam *pInput, int32_t inputNum, SScalarParam *pOutput) {
+  SColumnInfoData *pInputData  = pInput->columnData;
+  SColumnInfoData *pOutputData = pOutput->columnData;
+
+  double startVal, stepVal;
+  double matrix[2][3] = {0};
+  GET_TYPED_DATA(startVal, double, GET_PARAM_TYPE(&pInput[1]), pInput[1].columnData->pData);
+  GET_TYPED_DATA(stepVal, double, GET_PARAM_TYPE(&pInput[2]), pInput[2].columnData->pData);
+
+  int32_t type = GET_PARAM_TYPE(pInput);
+  int64_t count = 0;
+
+  switch(type) {
+    case TSDB_DATA_TYPE_TINYINT: {
+      int8_t *in   = (int8_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_SMALLINT: {
+      int16_t *in  = (int16_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_INT: {
+      int32_t *in  = (int32_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_BIGINT: {
+      int64_t *in  = (int64_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_UTINYINT: {
+      uint8_t *in   = (uint8_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_USMALLINT: {
+      uint16_t *in  = (uint16_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_UINT: {
+      uint32_t *in  = (uint32_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_UBIGINT: {
+      uint64_t *in  = (uint64_t *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_FLOAT: {
+      float *in  = (float *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+    case TSDB_DATA_TYPE_DOUBLE: {
+      double *in  = (double *)pInputData->pData;
+      for (int32_t i = 0; i < pInput->numOfRows; ++i) {
+        if (colDataIsNull_s(pInputData, i)) {
+          continue;
+        }
+
+        count++;
+        LEASTSQR_CAL(matrix, startVal, in, i, stepVal);
+      }
+      break;
+    }
+  }
+
+  if (count == 0) {
+    colDataAppendNULL(pOutputData, 0);
+  } else {
+    matrix[1][1] = (double)count;
+    matrix[1][0] = matrix[0][1];
+
+    double matrix00 = matrix[0][0] - matrix[1][0] * (matrix[0][1] / matrix[1][1]);
+    double matrix02 = matrix[0][2] - matrix[1][2] * (matrix[0][1] / matrix[1][1]);
+    double matrix12 = matrix[1][2] - matrix02 * (matrix[1][0] / matrix00);
+    matrix02 /= matrix00;
+
+    matrix12 /= matrix[1][1];
+
+    char   buf[64] = {0};
+    size_t len =
+        snprintf(varDataVal(buf), sizeof(buf) - VARSTR_HEADER_SIZE, "{slop:%.6lf, intercept:%.6lf}", matrix02, matrix12);
+    varDataSetLen(buf, len);
+    colDataAppend(pOutputData, 0, buf, false);
+
+  }
+
+  pOutput->numOfRows = 1;
+  return TSDB_CODE_SUCCESS;
+}
--- a/source/libs/scheduler/inc/schInt.h
+++ b/source/libs/scheduler/inc/schInt.h
@@ -28,15 +28,6 @@ extern "C" {
 #include "trpc.h"
 #include "command.h"

-#define SCHEDULE_DEFAULT_MAX_JOB_NUM 1000
-#define SCHEDULE_DEFAULT_MAX_TASK_NUM 1000
-#define SCHEDULE_DEFAULT_MAX_NODE_TABLE_NUM 200  // unit is TSDB_TABLE_NUM_UNIT
-
-#define SCH_DEFAULT_TASK_TIMEOUT_USEC 10000000
-#define SCH_MAX_TASK_TIMEOUT_USEC 60000000
-
-#define SCH_MAX_CANDIDATE_EP_NUM TSDB_MAX_REPLICA
-
 enum {
  SCH_READ = 1,
  SCH_WRITE,
@@ -54,6 +45,24 @@ typedef enum {
  SCH_OP_GET_STATUS,
 } SCH_OP_TYPE;

+typedef enum {
+  SCH_LOAD_SEQ = 1,
+  SCH_RANDOM,
+  SCH_ALL,
+} SCH_POLICY;
+
+#define SCHEDULE_DEFAULT_MAX_JOB_NUM 1000
+#define SCHEDULE_DEFAULT_MAX_TASK_NUM 1000
+#define SCHEDULE_DEFAULT_MAX_NODE_TABLE_NUM 200  // unit is TSDB_TABLE_NUM_UNIT
+#define SCHEDULE_DEFAULT_POLICY SCH_LOAD_SEQ
+
+#define SCH_DEFAULT_TASK_TIMEOUT_USEC 10000000
+#define SCH_MAX_TASK_TIMEOUT_USEC 60000000
+#define SCH_MAX_CANDIDATE_EP_NUM TSDB_MAX_REPLICA
+
+
+
+
 typedef struct SSchDebug {
  bool     lockEnable;
  bool     apiEnable;
@@ -126,6 +135,13 @@ typedef struct SSchStatusFps {
  schStatusEventFp eventFp;
 } SSchStatusFps;

+typedef struct SSchedulerCfg {
+  uint32_t   maxJobNum;
+  int32_t    maxNodeTableNum;
+  SCH_POLICY schPolicy;
+  bool       enableReSchedule;
+} SSchedulerCfg;
+
 typedef struct SSchedulerMgmt {
  uint64_t        taskId; // sequential taksId
  uint64_t        sId;    // schedulerId
@@ -184,34 +200,36 @@ typedef struct SSchLevel {

 typedef struct SSchTaskProfile {
  int64_t  startTs;
-  int64_t* execTime;
+  SArray*  execTime;
  int64_t  waitTime;
  int64_t  endTs;
 } SSchTaskProfile;

 typedef struct SSchTask {
-  uint64_t             taskId;         // task id
-  SRWLatch             lock;           // task reentrant lock
-  int32_t              maxExecTimes;   // task may exec times
-  int32_t              execId;        // task current execute try index
-  SSchLevel           *level;          // level
-  SRWLatch             planLock;       // task update plan lock
-  SSubplan            *plan;           // subplan
-  char                *msg;            // operator tree
-  int32_t              msgLen;         // msg length
-  int8_t               status;         // task status
-  int32_t              lastMsgType;    // last sent msg type
-  int64_t              timeoutUsec;    // taks timeout useconds before reschedule
-  SQueryNodeAddr       succeedAddr;    // task executed success node address
-  int8_t               candidateIdx;   // current try condidation index
-  SArray              *candidateAddrs; // condidate node addresses, element is SQueryNodeAddr
-  SHashObj            *execNodes;      // all tried node for current task, element is SSchNodeInfo
-  SSchTaskProfile      profile;        // task execution profile
-  int32_t              childReady;     // child task ready number
-  SArray              *children;       // the datasource tasks,from which to fetch the result, element is SQueryTask*
-  SArray              *parents;        // the data destination tasks, get data from current task, element is SQueryTask*
-  void*                handle;         // task send handle 
-  bool                 registerdHb;    // registered in hb
+  uint64_t             taskId;          // task id
+  SRWLatch             lock;            // task reentrant lock
+  int32_t              maxExecTimes;    // task max exec times
+  int32_t              maxRetryTimes;   // task max retry times
+  int32_t              retryTimes;      // task retry times
+  int32_t              execId;          // task current execute index
+  SSchLevel           *level;           // level
+  SRWLatch             planLock;        // task update plan lock
+  SSubplan            *plan;            // subplan
+  char                *msg;             // operator tree
+  int32_t              msgLen;          // msg length
+  int8_t               status;          // task status
+  int32_t              lastMsgType;     // last sent msg type
+  int64_t              timeoutUsec;     // task timeout useconds before reschedule
+  SQueryNodeAddr       succeedAddr;     // task executed success node address
+  int8_t               candidateIdx;    // current try condidation index
+  SArray              *candidateAddrs;  // condidate node addresses, element is SQueryNodeAddr
+  SHashObj            *execNodes;       // all tried node for current task, element is SSchNodeInfo
+  SSchTaskProfile      profile;         // task execution profile
+  int32_t              childReady;      // child task ready number
+  SArray              *children;        // the datasource tasks,from which to fetch the result, element is SQueryTask*
+  SArray              *parents;         // the data destination tasks, get data from current task, element is SQueryTask*
+  void*                handle;          // task send handle 
+  bool                 registerdHb;     // registered in hb
 } SSchTask;

 typedef struct SSchJobAttr {
@@ -265,7 +283,7 @@ typedef struct SSchJob {

 extern SSchedulerMgmt schMgmt;

-#define SCH_TASK_TIMEOUT(_task) ((taosGetTimestampUs() - (_task)->profile.execTime[(_task)->execId % (_task)->maxExecTimes]) > (_task)->timeoutUsec)
+#define SCH_TASK_TIMEOUT(_task) ((taosGetTimestampUs() - *(int64_t*)taosArrayGet((_task)->profile.execTime, (_task)->execId)) > (_task)->timeoutUsec)

 #define SCH_TASK_READY_FOR_LAUNCH(readyNum, task) ((readyNum) >= taosArrayGetSize((task)->children))

@@ -299,7 +317,6 @@ extern SSchedulerMgmt schMgmt;
 #define SCH_TASK_NEED_FLOW_CTRL(_job, _task) (SCH_IS_DATA_BIND_QRY_TASK(_task) && SCH_JOB_NEED_FLOW_CTRL(_job) && SCH_IS_LEVEL_UNFINISHED((_task)->level))
 #define SCH_FETCH_TYPE(_pSrcTask) (SCH_IS_DATA_BIND_QRY_TASK(_pSrcTask) ? TDMT_SCH_FETCH : TDMT_SCH_MERGE_FETCH)
 #define SCH_TASK_NEED_FETCH(_task) ((_task)->plan->subplanType != SUBPLAN_TYPE_MODIFY)
-#define SCH_TASK_MAX_EXEC_TIMES(_levelIdx, _levelNum) (SCH_MAX_CANDIDATE_EP_NUM * ((_levelNum) - (_levelIdx)))

 #define SCH_SET_JOB_TYPE(_job, type) do { if ((type) != SUBPLAN_TYPE_MODIFY) { (_job)->attr.queryJob = true; } } while (0)
 #define SCH_IS_QUERY_JOB(_job) ((_job)->attr.queryJob) 
@@ -321,8 +338,7 @@ extern SSchedulerMgmt schMgmt;
 #define SCH_LOG_TASK_START_TS(_task)                          \
  do {                                                        \
    int64_t us = taosGetTimestampUs();                        \
-    int32_t idx = (_task)->execId % (_task)->maxExecTimes; \
-    (_task)->profile.execTime[idx] = us;                    \
+    taosArrayPush((_task)->profile.execTime, &us);           \
    if (0 == (_task)->execId) {                              \
      (_task)->profile.startTs = us;                          \
    }                                                         \
@@ -331,8 +347,7 @@ extern SSchedulerMgmt schMgmt;
 #define SCH_LOG_TASK_WAIT_TS(_task)                        \
  do {                                                    \
    int64_t us = taosGetTimestampUs();                    \
-    int32_t idx = (_task)->execId % (_task)->maxExecTimes; \
-    (_task)->profile.waitTime += us - (_task)->profile.execTime[idx];    \
+    (_task)->profile.waitTime += us - *(int64_t*)taosArrayGet((_task)->profile.execTime, (_task)->execId);    \
  } while (0)  


@@ -340,7 +355,8 @@ extern SSchedulerMgmt schMgmt;
  do {                                                    \
    int64_t us = taosGetTimestampUs();                    \
    int32_t idx = (_task)->execId % (_task)->maxExecTimes; \
-    (_task)->profile.execTime[idx] = us - (_task)->profile.execTime[idx];    \
+    int64_t *startts = taosArrayGet((_task)->profile.execTime, (_task)->execId); \
+    *startts = us - *startts;                        \
    (_task)->profile.endTs = us;                          \
  } while (0)  

@@ -471,9 +487,11 @@ void    schFreeTask(SSchJob *pJob, SSchTask *pTask);
 void    schDropTaskInHashList(SSchJob *pJob, SHashObj *list);
 int32_t schLaunchLevelTasks(SSchJob *pJob, SSchLevel *level);
 int32_t schGetTaskFromList(SHashObj *pTaskList, uint64_t taskId, SSchTask **pTask);
-int32_t schInitTask(SSchJob *pJob, SSchTask *pTask, SSubplan *pPlan, SSchLevel *pLevel, int32_t levelNum);
+int32_t schInitTask(SSchJob *pJob, SSchTask *pTask, SSubplan *pPlan, SSchLevel *pLevel);
 int32_t schSwitchTaskCandidateAddr(SSchJob *pJob, SSchTask *pTask);
 void    schDirectPostJobRes(SSchedulerReq* pReq, int32_t errCode);
+int32_t schHandleJobFailure(SSchJob *pJob, int32_t errCode);
+int32_t schHandleJobDrop(SSchJob *pJob, int32_t errCode);
 bool    schChkCurrentOp(SSchJob *pJob, int32_t op, bool sync);

 extern SSchDebug gSCHDebug;

--- a/source/libs/scheduler/src/schJob.c
+++ b/source/libs/scheduler/src/schJob.c
@@ -343,7 +343,7 @@ int32_t schValidateAndBuildJob(SQueryPlan *pDag, SSchJob *pJob) {
        SCH_ERR_JRET(TSDB_CODE_QRY_OUT_OF_MEMORY);
      }

-      SCH_ERR_JRET(schInitTask(pJob, pTask, plan, pLevel, levelNum));
+      SCH_ERR_JRET(schInitTask(pJob, pTask, plan, pLevel));

      SCH_ERR_JRET(schAppendJobDataSrc(pJob, pTask));

@@ -476,7 +476,7 @@ _return:
  SCH_UNLOCK(SCH_WRITE, &pJob->opStatus.lock);
 }

-int32_t schProcessOnJobFailureImpl(SSchJob *pJob, int32_t status, int32_t errCode) {
+int32_t schProcessOnJobFailure(SSchJob *pJob, int32_t errCode) {
  schUpdateJobErrCode(pJob, errCode);
  
  int32_t code = atomic_load_32(&pJob->errCode);
@@ -489,21 +489,29 @@ int32_t schProcessOnJobFailureImpl(SSchJob *pJob, int32_t status, int32_t errCod
  SCH_RET(TSDB_CODE_SCH_IGNORE_ERROR);
 }

-// Note: no more task error processing, handled in function internal
-int32_t schProcessOnJobFailure(SSchJob *pJob, int32_t errCode) {
+int32_t schHandleJobFailure(SSchJob *pJob, int32_t errCode) {
  if (TSDB_CODE_SCH_IGNORE_ERROR == errCode) {
    return TSDB_CODE_SCH_IGNORE_ERROR;
  }

-  schProcessOnJobFailureImpl(pJob, JOB_TASK_STATUS_FAIL, errCode);
+  schSwitchJobStatus(pJob, JOB_TASK_STATUS_FAIL, &errCode);
  return TSDB_CODE_SCH_IGNORE_ERROR;
 }

-// Note: no more error processing, handled in function internal
 int32_t schProcessOnJobDropped(SSchJob *pJob, int32_t errCode) {
-  SCH_RET(schProcessOnJobFailureImpl(pJob, JOB_TASK_STATUS_DROP, errCode));
+  SCH_RET(schProcessOnJobFailure(pJob, errCode));
+}
+
+int32_t schHandleJobDrop(SSchJob *pJob, int32_t errCode) {
+  if (TSDB_CODE_SCH_IGNORE_ERROR == errCode) {
+    return TSDB_CODE_SCH_IGNORE_ERROR;
+  }
+
+  schSwitchJobStatus(pJob, JOB_TASK_STATUS_DROP, &errCode);
+  return TSDB_CODE_SCH_IGNORE_ERROR;
 }

+
 int32_t schProcessOnJobPartialSuccess(SSchJob *pJob) {
  schPostJobRes(pJob, SCH_OP_EXEC);

@@ -828,7 +836,7 @@ void schProcessOnOpEnd(SSchJob *pJob, SCH_OP_TYPE type, SSchedulerReq* pReq, int
  }

  if (errCode) {
-    schSwitchJobStatus(pJob, JOB_TASK_STATUS_FAIL, (void*)&errCode);
+    schHandleJobFailure(pJob, errCode);
  }

  SCH_JOB_DLOG("job end %s operation with code %s", schGetOpStr(type), tstrerror(errCode));
@@ -907,7 +915,7 @@ void schProcessOnCbEnd(SSchJob *pJob, SSchTask *pTask, int32_t errCode) {
  }

  if (errCode) {
-    schSwitchJobStatus(pJob, JOB_TASK_STATUS_FAIL, (void*)&errCode);
+    schHandleJobFailure(pJob, errCode);
  }
  
  if (pJob) {

--- a/source/libs/scheduler/src/schTask.c
+++ b/source/libs/scheduler/src/schTask.c
@@ -42,32 +42,47 @@ void schFreeTask(SSchJob *pJob, SSchTask *pTask) {
    taosHashCleanup(pTask->execNodes);
  }

-  taosMemoryFree(pTask->profile.execTime);
+  taosArrayDestroy(pTask->profile.execTime);
 }

-int32_t schInitTask(SSchJob *pJob, SSchTask *pTask, SSubplan *pPlan, SSchLevel *pLevel, int32_t levelNum) {
+void schInitTaskRetryTimes(SSchJob *pJob, SSchTask *pTask, SSchLevel *pLevel) {
+  if (SCH_IS_DATA_BIND_TASK(pTask) || (!SCH_IS_QUERY_JOB(pJob)) || (SCH_ALL != schMgmt.cfg.schPolicy)) {
+    pTask->maxRetryTimes = SCH_MAX_CANDIDATE_EP_NUM;
+  } else {
+    int32_t nodeNum = taosArrayGetSize(pJob->nodeList);
+    pTask->maxRetryTimes = TMAX(nodeNum, SCH_MAX_CANDIDATE_EP_NUM);
+  }
+  
+  pTask->maxExecTimes = pTask->maxRetryTimes * (pLevel->level + 1);
+}
+
+int32_t schInitTask(SSchJob *pJob, SSchTask *pTask, SSubplan *pPlan, SSchLevel *pLevel) {
  int32_t code = 0;

  pTask->plan = pPlan;
  pTask->level = pLevel;
  pTask->execId = -1;
-  pTask->maxExecTimes = SCH_TASK_MAX_EXEC_TIMES(pLevel->level, levelNum);
  pTask->timeoutUsec = SCH_DEFAULT_TASK_TIMEOUT_USEC;
  pTask->taskId = schGenTaskId();
  pTask->execNodes =
      taosHashInit(SCH_MAX_CANDIDATE_EP_NUM, taosGetDefaultHashFunction(TSDB_DATA_TYPE_INT), true, HASH_NO_LOCK);
-  pTask->profile.execTime = taosMemoryCalloc(pTask->maxExecTimes, sizeof(int64_t));
+
+  schInitTaskRetryTimes(pJob, pTask, pLevel);
+
+  pTask->profile.execTime = taosArrayInit(pTask->maxExecTimes, sizeof(int64_t));
  if (NULL == pTask->execNodes || NULL == pTask->profile.execTime) {
    SCH_ERR_JRET(TSDB_CODE_QRY_OUT_OF_MEMORY);
  }

  SCH_SET_TASK_STATUS(pTask, JOB_TASK_STATUS_INIT);

+  SCH_TASK_DLOG("task initialized, max times %d:%d", pTask->maxRetryTimes, pTask->maxExecTimes);
+
  return TSDB_CODE_SUCCESS;

 _return:

-  taosMemoryFreeClear(pTask->profile.execTime);
+  taosArrayDestroy(pTask->profile.execTime);
  taosHashCleanup(pTask->execNodes);

  SCH_RET(code);
@@ -105,7 +120,7 @@ int32_t schDropTaskExecNode(SSchJob *pJob, SSchTask *pTask, void *handle, int32_
  }

  if (taosHashRemove(pTask->execNodes, &execId, sizeof(execId))) {
-    SCH_TASK_ELOG("fail to remove execId %d from execNodeList", execId);
+    SCH_TASK_DLOG("execId %d already not in execNodeList", execId);
  } else {
    SCH_TASK_DLOG("execId %d removed from execNodeList", execId);
  }
@@ -235,7 +250,7 @@ int32_t schProcessOnTaskSuccess(SSchJob *pJob, SSchTask *pTask) {
      }

      if (pTask->level->taskFailed > 0) {
-        SCH_RET(schSwitchJobStatus(pJob, JOB_TASK_STATUS_FAIL, NULL));
+        SCH_RET(schHandleJobFailure(pJob, pJob->errCode));
      } else {
        SCH_RET(schSwitchJobStatus(pJob, JOB_TASK_STATUS_PART_SUCC, NULL));
      }
@@ -285,6 +300,10 @@ int32_t schProcessOnTaskSuccess(SSchJob *pJob, SSchTask *pTask) {
 }

 int32_t schRescheduleTask(SSchJob *pJob, SSchTask *pTask) {
+  if (!schMgmt.cfg.enableReSchedule) {
+    return TSDB_CODE_SUCCESS;
+  }
+  
  if (SCH_IS_DATA_BIND_TASK(pTask)) {
    return TSDB_CODE_SUCCESS;
  }
@@ -304,13 +323,17 @@ int32_t schRescheduleTask(SSchJob *pJob, SSchTask *pTask) {
 int32_t schDoTaskRedirect(SSchJob *pJob, SSchTask *pTask, SDataBuf *pData, int32_t rspCode) {
  int32_t code = 0;

-  if ((pTask->execId + 1) >= pTask->maxExecTimes) {
-    SCH_TASK_DLOG("task no more retry since reach max try times, execId:%d", pTask->execId);
-    schSwitchJobStatus(pJob, JOB_TASK_STATUS_FAIL, (void *)&rspCode);
-    return TSDB_CODE_SUCCESS;
+  SCH_TASK_DLOG("task will be redirected now, status:%s", SCH_GET_TASK_STATUS_STR(pTask));
+
+  if (NULL == pData) {
+    pTask->retryTimes = 0;
  }

-  SCH_TASK_DLOG("task will be redirected now, status:%s", SCH_GET_TASK_STATUS_STR(pTask));
+  if (((pTask->execId + 1) >= pTask->maxExecTimes) || ((pTask->retryTimes + 1) > pTask->maxRetryTimes)) {
+    SCH_TASK_DLOG("task no more retry since reach max times %d:%d, execId %d", pTask->maxRetryTimes, pTask->maxExecTimes, pTask->execId);
+    schHandleJobFailure(pJob, rspCode);
+    return TSDB_CODE_SUCCESS;
+  }

  schDropTaskOnExecNode(pJob, pTask);
  taosHashClear(pTask->execNodes);
@@ -493,9 +516,15 @@ int32_t schTaskCheckSetRetry(SSchJob *pJob, SSchTask *pTask, int32_t errCode, bo
    }
  }

+  if ((pTask->retryTimes + 1) > pTask->maxRetryTimes) {
+    *needRetry = false;
+    SCH_TASK_DLOG("task no more retry since reach max retry times, retryTimes:%d/%d", pTask->retryTimes, pTask->maxRetryTimes);
+    return TSDB_CODE_SUCCESS;
+  }
+
  if ((pTask->execId + 1) >= pTask->maxExecTimes) {
    *needRetry = false;
-    SCH_TASK_DLOG("task no more retry since reach max try times, execId:%d", pTask->execId);
+    SCH_TASK_DLOG("task no more retry since reach max exec times, execId:%d/%d", pTask->execId, pTask->maxExecTimes);
    return TSDB_CODE_SUCCESS;
  }

@@ -649,10 +678,31 @@ int32_t schUpdateTaskCandidateAddr(SSchJob *pJob, SSchTask *pTask, SEpSet *pEpSe

 int32_t schSwitchTaskCandidateAddr(SSchJob *pJob, SSchTask *pTask) {
  int32_t candidateNum = taosArrayGetSize(pTask->candidateAddrs);
-  if (++pTask->candidateIdx >= candidateNum) {
-    pTask->candidateIdx = 0;
+  if (candidateNum <= 1) {
+    goto _return;
+  }
+  
+  switch (schMgmt.cfg.schPolicy) {
+    case SCH_LOAD_SEQ:
+    case SCH_ALL: 
+    default:
+      if (++pTask->candidateIdx >= candidateNum) {
+        pTask->candidateIdx = 0;
+      }
+      break;
+    case SCH_RANDOM: {
+      int32_t lastIdx = pTask->candidateIdx;
+      while (lastIdx == pTask->candidateIdx) {
+        pTask->candidateIdx = taosRand() % candidateNum;
+      }
+      break;
+    }
  }
-  SCH_TASK_DLOG("switch task candiateIdx to %d", pTask->candidateIdx);
+
+_return:
+
+  SCH_TASK_DLOG("switch task candiateIdx to %d/%d", pTask->candidateIdx, candidateNum);
+  
  return TSDB_CODE_SUCCESS;
 }

@@ -739,8 +789,9 @@ int32_t schLaunchTaskImpl(SSchJob *pJob, SSchTask *pTask) {

  atomic_add_fetch_32(&pTask->level->taskLaunchedNum, 1);
  pTask->execId++;
+  pTask->retryTimes++;

-  SCH_TASK_DLOG("start to launch task's %dth exec", pTask->execId);
+  SCH_TASK_DLOG("start to launch task, execId %d, retry %d", pTask->execId, pTask->retryTimes);

  SCH_LOG_TASK_START_TS(pTask);


--- a/source/libs/scheduler/src/scheduler.c
+++ b/source/libs/scheduler/src/scheduler.c
@@ -22,26 +22,19 @@ SSchedulerMgmt schMgmt = {
    .jobRef = -1,
 };

-int32_t schedulerInit(SSchedulerCfg *cfg) {
+int32_t schedulerInit() {
  if (schMgmt.jobRef >= 0) {
    qError("scheduler already initialized");
    return TSDB_CODE_QRY_INVALID_INPUT;
  }

-  if (cfg) {
-    schMgmt.cfg = *cfg;
-
-    if (schMgmt.cfg.maxJobNum == 0) {
-      schMgmt.cfg.maxJobNum = SCHEDULE_DEFAULT_MAX_JOB_NUM;
-    }
-    if (schMgmt.cfg.maxNodeTableNum <= 0) {
-      schMgmt.cfg.maxNodeTableNum = SCHEDULE_DEFAULT_MAX_NODE_TABLE_NUM;
-    }
-  } else {
-    schMgmt.cfg.maxJobNum = SCHEDULE_DEFAULT_MAX_JOB_NUM;
-    schMgmt.cfg.maxNodeTableNum = SCHEDULE_DEFAULT_MAX_NODE_TABLE_NUM;
-  }
+  schMgmt.cfg.maxJobNum = SCHEDULE_DEFAULT_MAX_JOB_NUM;
+  schMgmt.cfg.maxNodeTableNum = SCHEDULE_DEFAULT_MAX_NODE_TABLE_NUM;
+  schMgmt.cfg.schPolicy = SCHEDULE_DEFAULT_POLICY;
+  schMgmt.cfg.enableReSchedule = true;

+  qDebug("schedule policy init to %d", schMgmt.cfg.schPolicy);
+  
  schMgmt.jobRef = taosOpenRef(schMgmt.cfg.maxJobNum, schFreeJobImpl);
  if (schMgmt.jobRef < 0) {
    qError("init schduler jobRef failed, num:%u", schMgmt.cfg.maxJobNum);
@@ -130,6 +123,26 @@ void schedulerStopQueryHb(void *pTrans) {
  schCleanClusterHb(pTrans);
 }

+int32_t schedulerUpdatePolicy(int32_t policy) {
+  switch (policy) {
+    case SCH_LOAD_SEQ:
+    case SCH_RANDOM:
+    case SCH_ALL:
+      schMgmt.cfg.schPolicy = policy;
+      qDebug("schedule policy updated to %d", schMgmt.cfg.schPolicy);
+      break;
+    default:
+      return TSDB_CODE_TSC_INVALID_INPUT;
+  }
+
+  return TSDB_CODE_SUCCESS;
+}
+
+int32_t schedulerEnableReSchedule(bool enableResche) {
+  schMgmt.cfg.enableReSchedule = enableResche;
+  return TSDB_CODE_SUCCESS;
+}
+
 void schedulerFreeJob(int64_t* jobId, int32_t errCode) {
  if (0 == *jobId) {
    return;
@@ -141,7 +154,7 @@ void schedulerFreeJob(int64_t* jobId, int32_t errCode) {
    return;
  }

-  schSwitchJobStatus(pJob, JOB_TASK_STATUS_DROP, (void*)&errCode);
+  schHandleJobDrop(pJob, errCode);
  
  schReleaseJob(*jobId);
  *jobId = 0;

--- a/source/libs/scheduler/test/schedulerTests.cpp
+++ b/source/libs/scheduler/test/schedulerTests.cpp
@@ -477,7 +477,7 @@ void* schtRunJobThread(void *aa) {
  schtInitLogFile();

  
-  int32_t code = schedulerInit(NULL);
+  int32_t code = schedulerInit();
  assert(code == 0);


@@ -649,7 +649,7 @@ TEST(queryTest, normalCase) {
  qnodeAddr.port = 6031;
  taosArrayPush(qnodeList, &qnodeAddr);
  
-  int32_t code = schedulerInit(NULL);
+  int32_t code = schedulerInit();
  ASSERT_EQ(code, 0);

  schtBuildQueryDag(&dag);
@@ -756,7 +756,7 @@ TEST(queryTest, readyFirstCase) {
  qnodeAddr.port = 6031;
  taosArrayPush(qnodeList, &qnodeAddr);
  
-  int32_t code = schedulerInit(NULL);
+  int32_t code = schedulerInit();
  ASSERT_EQ(code, 0);

  schtBuildQueryDag(&dag);
@@ -866,7 +866,7 @@ TEST(queryTest, flowCtrlCase) {
  qnodeAddr.port = 6031;
  taosArrayPush(qnodeList, &qnodeAddr);
  
-  int32_t code = schedulerInit(NULL);
+  int32_t code = schedulerInit();
  ASSERT_EQ(code, 0);

  schtBuildQueryFlowCtrlDag(&dag);
@@ -975,7 +975,7 @@ TEST(insertTest, normalCase) {
  qnodeAddr.port = 6031;
  taosArrayPush(qnodeList, &qnodeAddr);
  
-  int32_t code = schedulerInit(NULL);
+  int32_t code = schedulerInit();
  ASSERT_EQ(code, 0);

  schtBuildInsertDag(&dag);

--- a/source/libs/stream/src/stream.c
+++ b/source/libs/stream/src/stream.c
@@ -57,7 +57,7 @@ void streamTriggerByTimer(void* param, void* tmrId) {
  if (atomic_load_8(&pTask->triggerStatus) == TASK_TRIGGER_STATUS__ACTIVE) {
    SStreamTrigger* trigger = taosAllocateQitem(sizeof(SStreamTrigger), DEF_QITEM);
    if (trigger == NULL) return;
-    trigger->type = STREAM_INPUT__TRIGGER;
+    trigger->type = STREAM_INPUT__GET_RES;
    trigger->pBlock = taosMemoryCalloc(1, sizeof(SSDataBlock));
    if (trigger->pBlock == NULL) {
      taosFreeQitem(trigger);
@@ -183,8 +183,11 @@ int32_t streamProcessDispatchReq(SStreamTask* pTask, SStreamDispatchReq* pReq, S
  // 2.1. idle: exec
  // 2.2. executing: return
  // 2.3. closing: keep trying
+#if 0
  if (pTask->execType != TASK_EXEC__NONE) {
-    streamExec(pTask, pTask->pMsgCb);
+#endif
+  streamExec(pTask, pTask->pMsgCb);
+#if 0
  } else {
    ASSERT(pTask->sinkType != TASK_SINK__NONE);
    while (1) {
@@ -195,11 +198,13 @@ int32_t streamProcessDispatchReq(SStreamTask* pTask, SStreamDispatchReq* pReq, S
      }
    }
  }
+#endif

  // 3. handle output
  // 3.1 check and set status
  // 3.2 dispatch / sink
  if (pTask->dispatchType != TASK_DISPATCH__NONE) {
+    ASSERT(pTask->sinkType == TASK_SINK__NONE);
    streamDispatch(pTask, pTask->pMsgCb);
  }


--- a/source/libs/stream/src/streamData.c
+++ b/source/libs/stream/src/streamData.c
@@ -112,7 +112,7 @@ int32_t streamAppendQueueItem(SStreamQueueItem* dst, SStreamQueueItem* elem) {

 void streamFreeQitem(SStreamQueueItem* data) {
  int8_t type = data->type;
-  if (type == STREAM_INPUT__TRIGGER) {
+  if (type == STREAM_INPUT__GET_RES) {
    blockDataDestroy(((SStreamTrigger*)data)->pBlock);
    taosFreeQitem(data);
  } else if (type == STREAM_INPUT__DATA_BLOCK || type == STREAM_INPUT__DATA_RETRIEVE) {

--- a/source/libs/stream/src/streamExec.c
+++ b/source/libs/stream/src/streamExec.c
@@ -20,7 +20,7 @@ static int32_t streamTaskExecImpl(SStreamTask* pTask, void* data, SArray* pRes)

  // set input
  SStreamQueueItem* pItem = (SStreamQueueItem*)data;
-  if (pItem->type == STREAM_INPUT__TRIGGER) {
+  if (pItem->type == STREAM_INPUT__GET_RES) {
    SStreamTrigger* pTrigger = (SStreamTrigger*)data;
    qSetMultiStreamInput(exec, pTrigger->pBlock, 1, STREAM_INPUT__DATA_BLOCK, false);
  } else if (pItem->type == STREAM_INPUT__DATA_SUBMIT) {
@@ -73,6 +73,15 @@ static int32_t streamTaskExecImpl(SStreamTask* pTask, void* data, SArray* pRes)
  return 0;
 }

+static FORCE_INLINE int32_t streamUpdateVer(SStreamTask* pTask, SStreamDataBlock* pBlock) {
+  ASSERT(pBlock->type == STREAM_INPUT__DATA_BLOCK);
+  int32_t             childId = pBlock->childId;
+  int64_t             ver = pBlock->sourceVer;
+  SStreamChildEpInfo* pChildInfo = taosArrayGetP(pTask->childEpInfo, childId);
+  pChildInfo->processedVer = ver;
+  return 0;
+}
+
 static SArray* streamExecForQall(SStreamTask* pTask, SArray* pRes) {
  int32_t cnt = 0;
  void*   data = NULL;
@@ -84,14 +93,17 @@ static SArray* streamExecForQall(SStreamTask* pTask, SArray* pRes) {
    }
    if (data == NULL) {
      data = qItem;
+      if (qItem->type == STREAM_INPUT__DATA_BLOCK) {
+        /*streamUpdateVer(pTask, (SStreamDataBlock*)qItem);*/
+      }
      streamQueueProcessSuccess(pTask->inputQueue);
-      continue;
    } else {
      if (streamAppendQueueItem(data, qItem) < 0) {
        streamQueueProcessFail(pTask->inputQueue);
        break;
      } else {
        cnt++;
+        /*streamUpdateVer(pTask, (SStreamDataBlock*)qItem);*/
        streamQueueProcessSuccess(pTask->inputQueue);
        taosArrayDestroy(((SStreamDataBlock*)qItem)->blocks);
        taosFreeQitem(qItem);
@@ -106,6 +118,12 @@ static SArray* streamExecForQall(SStreamTask* pTask, SArray* pRes) {

  if (data == NULL) return pRes;

+  if (pTask->execType == TASK_EXEC__NONE) {
+    ASSERT(((SStreamQueueItem*)data)->type == STREAM_INPUT__DATA_BLOCK);
+    streamTaskOutput(pTask, data);
+    return pRes;
+  }
+
  qDebug("stream task %d exec begin, msg batch: %d", pTask->taskId, cnt);
  streamTaskExecImpl(pTask, data, pRes);
  qDebug("stream task %d exec end", pTask->taskId);
@@ -125,6 +143,11 @@ static SArray* streamExecForQall(SStreamTask* pTask, SArray* pRes) {
      taosFreeQitem(qRes);
      return NULL;
    }
+    if (((SStreamQueueItem*)data)->type == STREAM_INPUT__DATA_SUBMIT) {
+      SStreamDataSubmit* pSubmit = (SStreamDataSubmit*)data;
+      qRes->childId = pTask->selfChildId;
+      qRes->sourceVer = pSubmit->ver;
+    }
    /*streamQueueProcessSuccess(pTask->inputQueue);*/
    pRes = taosArrayInit(0, sizeof(SSDataBlock));
  }

--- a/source/libs/stream/src/streamRecover.c
+++ b/source/libs/stream/src/streamRecover.c
+/*
+ * Copyright (c) 2019 TAOS Data, Inc. <jhtao@taosdata.com>
+ *
+ * This program is free software: you can use, redistribute, and/or modify
+ * it under the terms of the GNU Affero General Public License, version 3
+ * or later ("AGPL"), as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * You should have received a copy of the GNU Affero General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "streamInc.h"
+
+int32_t tEncodeStreamTaskRecoverReq(SEncoder* pEncoder, const SStreamTaskRecoverReq* pReq) {
+  if (tStartEncode(pEncoder) < 0) return -1;
+  if (tEncodeI64(pEncoder, pReq->streamId) < 0) return -1;
+  if (tEncodeI32(pEncoder, pReq->taskId) < 0) return -1;
+  if (tEncodeI32(pEncoder, pReq->sourceTaskId) < 0) return -1;
+  if (tEncodeI32(pEncoder, pReq->sourceVg) < 0) return -1;
+  tEndEncode(pEncoder);
+  return pEncoder->pos;
+}
+
+int32_t tDecodeStreamTaskRecoverReq(SDecoder* pDecoder, SStreamTaskRecoverReq* pReq) {
+  if (tStartDecode(pDecoder) < 0) return -1;
+  if (tDecodeI64(pDecoder, &pReq->streamId) < 0) return -1;
+  if (tDecodeI32(pDecoder, &pReq->taskId) < 0) return -1;
+  if (tDecodeI32(pDecoder, &pReq->sourceTaskId) < 0) return -1;
+  if (tDecodeI32(pDecoder, &pReq->sourceVg) < 0) return -1;
+  tEndDecode(pDecoder);
+  return 0;
+}
+
+int32_t tEncodeStreamTaskRecoverRsp(SEncoder* pEncoder, const SStreamTaskRecoverRsp* pRsp) {
+  if (tStartEncode(pEncoder) < 0) return -1;
+  if (tEncodeI64(pEncoder, pRsp->streamId) < 0) return -1;
+  if (tEncodeI32(pEncoder, pRsp->taskId) < 0) return -1;
+  if (tEncodeI8(pEncoder, pRsp->inputStatus) < 0) return -1;
+  tEndEncode(pEncoder);
+  return pEncoder->pos;
+}
+
+int32_t tDecodeStreamTaskRecoverRsp(SDecoder* pDecoder, SStreamTaskRecoverRsp* pReq) {
+  if (tStartDecode(pDecoder) < 0) return -1;
+  if (tDecodeI64(pDecoder, &pReq->streamId) < 0) return -1;
+  if (tDecodeI32(pDecoder, &pReq->taskId) < 0) return -1;
+  if (tDecodeI8(pDecoder, &pReq->inputStatus) < 0) return -1;
+  tEndDecode(pDecoder);
+  return 0;
+}
+
+int32_t tEncodeSMStreamTaskRecoverReq(SEncoder* pEncoder, const SMStreamTaskRecoverReq* pReq) {
+  if (tStartEncode(pEncoder) < 0) return -1;
+  if (tEncodeI64(pEncoder, pReq->streamId) < 0) return -1;
+  if (tEncodeI32(pEncoder, pReq->taskId) < 0) return -1;
+  tEndEncode(pEncoder);
+  return pEncoder->pos;
+}
+
+int32_t tDecodeSMStreamTaskRecoverReq(SDecoder* pDecoder, SMStreamTaskRecoverReq* pReq) {
+  if (tStartDecode(pDecoder) < 0) return -1;
+  if (tDecodeI64(pDecoder, &pReq->streamId) < 0) return -1;
+  if (tDecodeI32(pDecoder, &pReq->taskId) < 0) return -1;
+  tEndDecode(pDecoder);
+  return 0;
+}
+
+int32_t tEncodeSMStreamTaskRecoverRsp(SEncoder* pEncoder, const SMStreamTaskRecoverRsp* pRsp) {
+  if (tStartEncode(pEncoder) < 0) return -1;
+  if (tEncodeI64(pEncoder, pRsp->streamId) < 0) return -1;
+  if (tEncodeI32(pEncoder, pRsp->taskId) < 0) return -1;
+  tEndEncode(pEncoder);
+  return pEncoder->pos;
+}
+
+int32_t tDecodeSMStreamTaskRecoverRsp(SDecoder* pDecoder, SMStreamTaskRecoverRsp* pReq) {
+  if (tStartDecode(pDecoder) < 0) return -1;
+  if (tDecodeI64(pDecoder, &pReq->streamId) < 0) return -1;
+  if (tDecodeI32(pDecoder, &pReq->taskId) < 0) return -1;
+  tEndDecode(pDecoder);
+  return 0;
+}
+
+int32_t streamProcessFailRecoverReq(SStreamTask* pTask, SMStreamTaskRecoverReq* pReq, SRpcMsg* pRsp) {
+  if (pTask->taskStatus != TASK_STATUS__FAIL) {
+    return 0;
+  }
+
+  if (pTask->isStreamDistributed) {
+    if (pTask->isDataScan) {
+      pTask->taskStatus = TASK_STATUS__PREPARE_RECOVER;
+    } else if (pTask->execType != TASK_EXEC__NONE) {
+      pTask->taskStatus = TASK_STATUS__PREPARE_RECOVER;
+      bool    hasCheckpoint = false;
+      int32_t childSz = taosArrayGetSize(pTask->childEpInfo);
+      for (int32_t i = 0; i < childSz; i++) {
+        SStreamChildEpInfo* pEpInfo = taosArrayGetP(pTask->childEpInfo, i);
+        if (pEpInfo->checkpointVer == -1) {
+          hasCheckpoint = true;
+          break;
+        }
+      }
+      if (hasCheckpoint) {
+        // load from checkpoint
+      } else {
+        // recover child
+      }
+    }
+  } else {
+    if (pTask->isDataScan) {
+      if (pTask->checkpointVer != -1) {
+        // load from checkpoint
+      } else {
+        // reset stream query task info
+        // TODO get snapshot ver
+        pTask->recoverSnapVer = -1;
+        qStreamPrepareRecover(pTask->exec.executor, pTask->startVer, pTask->recoverSnapVer);
+        pTask->taskStatus = TASK_STATUS__RECOVERING;
+      }
+    }
+  }
+
+  if (pTask->taskStatus == TASK_STATUS__RECOVERING) {
+    streamProcessRunReq(pTask);
+  }
+  return 0;
+}
--- a/source/libs/stream/src/streamTask.c
+++ b/source/libs/stream/src/streamTask.c
@@ -34,6 +34,7 @@ int32_t tEncodeStreamEpInfo(SEncoder* pEncoder, const SStreamChildEpInfo* pInfo)
  if (tEncodeI32(pEncoder, pInfo->taskId) < 0) return -1;
  if (tEncodeI32(pEncoder, pInfo->nodeId) < 0) return -1;
  if (tEncodeI32(pEncoder, pInfo->childId) < 0) return -1;
+  if (tEncodeI64(pEncoder, pInfo->processedVer) < 0) return -1;
  if (tEncodeSEpSet(pEncoder, &pInfo->epSet) < 0) return -1;
  return 0;
 }
@@ -42,6 +43,7 @@ int32_t tDecodeStreamEpInfo(SDecoder* pDecoder, SStreamChildEpInfo* pInfo) {
  if (tDecodeI32(pDecoder, &pInfo->taskId) < 0) return -1;
  if (tDecodeI32(pDecoder, &pInfo->nodeId) < 0) return -1;
  if (tDecodeI32(pDecoder, &pInfo->childId) < 0) return -1;
+  if (tDecodeI64(pDecoder, &pInfo->processedVer) < 0) return -1;
  if (tDecodeSEpSet(pDecoder, &pInfo->epSet) < 0) return -1;
  return 0;
 }

--- a/source/libs/sync/inc/syncInt.h
+++ b/source/libs/sync/inc/syncInt.h
@@ -253,6 +253,7 @@ bool syncNodeCheckNewConfig(SSyncNode* pSyncNode, const SSyncCfg* pNewCfg);

 int32_t syncNodeLeaderTransfer(SSyncNode* pSyncNode);
 int32_t syncNodeLeaderTransferTo(SSyncNode* pSyncNode, SNodeInfo newLeader);
+int32_t syncDoLeaderTransfer(SSyncNode* ths, SRpcMsg* pRpcMsg, SSyncRaftEntry* pEntry);

 // for debug --------------
 void syncNodePrint(SSyncNode* pObj);

--- a/source/libs/sync/src/syncAppendEntries.c
+++ b/source/libs/sync/src/syncAppendEntries.c
@@ -477,6 +477,13 @@ static int32_t syncNodeDoMakeLogSame(SSyncNode* ths, SyncIndex FromIndex) {
 static int32_t syncNodePreCommit(SSyncNode* ths, SSyncRaftEntry* pEntry) {
  SRpcMsg rpcMsg;
  syncEntry2OriginalRpc(pEntry, &rpcMsg);
+
+  // leader transfer
+  if (pEntry->originalRpcType == TDMT_SYNC_LEADER_TRANSFER) {
+    int32_t code = syncDoLeaderTransfer(ths, &rpcMsg, pEntry);
+    ASSERT(code == 0);
+  }
+
  if (ths->pFsm != NULL) {
    if (ths->pFsm->FpPreCommitCb != NULL && syncUtilUserPreCommit(pEntry->originalRpcType)) {
      SFsmCbMeta cbMeta = {0};

--- a/source/libs/sync/src/syncMain.c
+++ b/source/libs/sync/src/syncMain.c
@@ -1853,8 +1853,8 @@ void syncNodeDoConfigChange(SSyncNode* pSyncNode, SSyncCfg* pNewConfig, SyncInde
      syncNodeBecomeLeader(pSyncNode, tmpbuf);

      // Raft 3.6.2 Committing entries from previous terms
-      syncNodeReplicate(pSyncNode);
      syncNodeAppendNoop(pSyncNode);
+      syncNodeReplicate(pSyncNode);
      syncMaybeAdvanceCommitIndex(pSyncNode);

    } else {
@@ -2029,8 +2029,8 @@ void syncNodeCandidate2Leader(SSyncNode* pSyncNode) {
  syncNodeLog2("==state change syncNodeCandidate2Leader==", pSyncNode);

  // Raft 3.6.2 Committing entries from previous terms
-  syncNodeReplicate(pSyncNode);
  syncNodeAppendNoop(pSyncNode);
+  syncNodeReplicate(pSyncNode);
  syncMaybeAdvanceCommitIndex(pSyncNode);
 }

@@ -2598,9 +2598,13 @@ const char* syncStr(ESyncState state) {
  }
 }

-static int32_t syncDoLeaderTransfer(SSyncNode* ths, SRpcMsg* pRpcMsg, SSyncRaftEntry* pEntry) {
+int32_t syncDoLeaderTransfer(SSyncNode* ths, SRpcMsg* pRpcMsg, SSyncRaftEntry* pEntry) {
  SyncLeaderTransfer* pSyncLeaderTransfer = syncLeaderTransferFromRpcMsg2(pRpcMsg);

+  if (ths->state != TAOS_SYNC_STATE_FOLLOWER) {
+    syncNodeEventLog(ths, "I am not follower, can not do leader transfer");
+    return 0;
+  }
  syncNodeEventLog(ths, "do leader transfer");

  bool sameId = syncUtilSameId(&(pSyncLeaderTransfer->newLeaderId), &(ths->myRaftId));
@@ -2811,11 +2815,14 @@ int32_t syncNodeCommit(SSyncNode* ths, SyncIndex beginIndex, SyncIndex endIndex,
          ASSERT(code == 0);
        }

+#if 0
+        // execute in pre-commit
        // leader transfer
        if (pEntry->originalRpcType == TDMT_SYNC_LEADER_TRANSFER) {
          code = syncDoLeaderTransfer(ths, &rpcMsg, pEntry);
          ASSERT(code == 0);
        }
+#endif

        // restore finish
        // if only snapshot, a noop entry will be append, so syncLogLastIndex is always ok

--- a/source/libs/transport/src/transCli.c
+++ b/source/libs/transport/src/transCli.c
@@ -1042,7 +1042,7 @@ static void cliSchedMsgToNextNode(SCliMsg* pMsg, SCliThrd* pThrd) {
  STraceId* trace = &pMsg->msg.info.traceId;
  char      tbuf[256] = {0};
  EPSET_DEBUG_STR(&pCtx->epSet, tbuf);
-  tGTrace("%s retry on next node, use %s, retryCnt:%d, limit:%d", transLabel(pThrd->pTransInst), tbuf,
+  tGDebug("%s retry on next node, use %s, retryCnt:%d, limit:%d", transLabel(pThrd->pTransInst), tbuf,
          pCtx->retryCnt + 1, pCtx->retryLimit);

  STaskArg* arg = taosMemoryMalloc(sizeof(STaskArg));
@@ -1134,11 +1134,11 @@ int cliAppCb(SCliConn* pConn, STransMsg* pResp, SCliMsg* pMsg) {
  if (hasEpSet) {
    char tbuf[256] = {0};
    EPSET_DEBUG_STR(&pCtx->epSet, tbuf);
-    tGTrace("%s conn %p extract epset from msg", CONN_GET_INST_LABEL(pConn), pConn);
+    tGDebug("%s conn %p extract epset from msg", CONN_GET_INST_LABEL(pConn), pConn);
  }

  if (pCtx->pSem != NULL) {
-    tGTrace("%s conn %p(sync) handle resp", CONN_GET_INST_LABEL(pConn), pConn);
+    tGDebug("%s conn %p(sync) handle resp", CONN_GET_INST_LABEL(pConn), pConn);
    if (pCtx->pRsp == NULL) {
      tGTrace("%s conn %p(sync) failed to resp, ignore", CONN_GET_INST_LABEL(pConn), pConn);
    } else {
@@ -1147,7 +1147,7 @@ int cliAppCb(SCliConn* pConn, STransMsg* pResp, SCliMsg* pMsg) {
    tsem_post(pCtx->pSem);
    pCtx->pRsp = NULL;
  } else {
-    tGTrace("%s conn %p handle resp", CONN_GET_INST_LABEL(pConn), pConn);
+    tGDebug("%s conn %p handle resp", CONN_GET_INST_LABEL(pConn), pConn);
    if (retry == false && hasEpSet == true) {
      pTransInst->cfp(pTransInst->parent, pResp, &pCtx->epSet);
    } else {
@@ -1257,7 +1257,7 @@ void transSendRequest(void* shandle, const SEpSet* pEpSet, STransMsg* pReq, STra
  cliMsg->refId = (int64_t)shandle;

  STraceId* trace = &pReq->info.traceId;
-  tGTrace("%s send request at thread:%08" PRId64 ", dst:%s:%d, app:%p", transLabel(pTransInst), pThrd->pid,
+  tGDebug("%s send request at thread:%08" PRId64 ", dst:%s:%d, app:%p", transLabel(pTransInst), pThrd->pid,
          EPSET_GET_INUSE_IP(&pCtx->epSet), EPSET_GET_INUSE_PORT(&pCtx->epSet), pReq->info.ahandle);
  ASSERT(transAsyncSend(pThrd->asyncPool, &(cliMsg->q)) == 0);
  transReleaseExHandle(transGetInstMgt(), (int64_t)shandle);
@@ -1297,7 +1297,7 @@ void transSendRecv(void* shandle, const SEpSet* pEpSet, STransMsg* pReq, STransM
  cliMsg->refId = (int64_t)shandle;

  STraceId* trace = &pReq->info.traceId;
-  tGTrace("%s send request at thread:%08" PRId64 ", dst:%s:%d, app:%p", transLabel(pTransInst), pThrd->pid,
+  tGDebug("%s send request at thread:%08" PRId64 ", dst:%s:%d, app:%p", transLabel(pTransInst), pThrd->pid,
          EPSET_GET_INUSE_IP(&pCtx->epSet), EPSET_GET_INUSE_PORT(&pCtx->epSet), pReq->info.ahandle);

  transAsyncSend(pThrd->asyncPool, &(cliMsg->q));

--- a/source/libs/transport/src/transSvr.c
+++ b/source/libs/transport/src/transSvr.c
@@ -1020,7 +1020,7 @@ void transRefSrvHandle(void* handle) {
    return;
  }
  int ref = T_REF_INC((SSvrConn*)handle);
-  tDebug("conn %p ref count:%d", handle, ref);
+  tTrace("conn %p ref count:%d", handle, ref);
 }

 void transUnrefSrvHandle(void* handle) {
@@ -1028,7 +1028,7 @@ void transUnrefSrvHandle(void* handle) {
    return;
  }
  int ref = T_REF_DEC((SSvrConn*)handle);
-  tDebug("conn %p ref count:%d", handle, ref);
+  tTrace("conn %p ref count:%d", handle, ref);
  if (ref == 0) {
    destroyConn((SSvrConn*)handle, true);
  }

--- a/source/util/src/terror.c
+++ b/source/util/src/terror.c
@@ -78,6 +78,7 @@ TAOS_DEFINE_ERROR(TSDB_CODE_INVALID_TIMESTAMP,            "Invalid timestamp for
 TAOS_DEFINE_ERROR(TSDB_CODE_MSG_DECODE_ERROR,             "Msg decode error")
 TAOS_DEFINE_ERROR(TSDB_CODE_NO_AVAIL_DISK,                "No available disk")
 TAOS_DEFINE_ERROR(TSDB_CODE_NOT_FOUND,                    "Not found")
+TAOS_DEFINE_ERROR(TSDB_CODE_TIME_UNSYNCED,                "Unsynced time")

 TAOS_DEFINE_ERROR(TSDB_CODE_REF_NO_MEMORY,                "Ref out of memory")
 TAOS_DEFINE_ERROR(TSDB_CODE_REF_FULL,                     "too many Ref Objs")
@@ -126,7 +127,7 @@ TAOS_DEFINE_ERROR(TSDB_CODE_TSC_NO_META_CACHED,           "No table meta cached"
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_DUP_COL_NAMES,            "duplicated column names")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_INVALID_TAG_LENGTH,       "Invalid tag length")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_INVALID_COLUMN_LENGTH,    "Invalid column length")
-TAOS_DEFINE_ERROR(TSDB_CODE_TSC_DUP_TAG_NAMES,            "duplicated tag names")
+TAOS_DEFINE_ERROR(TSDB_CODE_TSC_DUP_NAMES,                "duplicated names")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_INVALID_JSON,             "Invalid JSON format")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_INVALID_JSON_TYPE,        "Invalid JSON data type")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_VALUE_OUT_OF_RANGE,       "Value out of range")
@@ -135,7 +136,7 @@ TAOS_DEFINE_ERROR(TSDB_CODE_TSC_STMT_API_ERROR,           "Stmt API usage error"
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_STMT_TBNAME_ERROR,        "Stmt table name not set")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_STMT_CLAUSE_ERROR,        "not supported stmt clause")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_QUERY_KILLED,             "Query killed")
-TAOS_DEFINE_ERROR(TSDB_CODE_TSC_NO_EXEC_NODE,             "No available execution node")
+TAOS_DEFINE_ERROR(TSDB_CODE_TSC_NO_EXEC_NODE,             "No available execution node in current query policy configuration")
 TAOS_DEFINE_ERROR(TSDB_CODE_TSC_NOT_STABLE_ERROR,         "Table is not a super table")

 // mnode-common
@@ -581,8 +582,9 @@ TAOS_DEFINE_ERROR(TSDB_CODE_UDF_INVALID_OUTPUT_TYPE,        "udf invalid output
 //schemaless
 TAOS_DEFINE_ERROR(TSDB_CODE_SML_INVALID_PROTOCOL_TYPE,      "Invalid line protocol type")
 TAOS_DEFINE_ERROR(TSDB_CODE_SML_INVALID_PRECISION_TYPE,     "Invalid timestamp precision type")
-TAOS_DEFINE_ERROR(TSDB_CODE_SML_INVALID_DATA,               "Invalid data type")
+TAOS_DEFINE_ERROR(TSDB_CODE_SML_INVALID_DATA,               "Invalid data format")
 TAOS_DEFINE_ERROR(TSDB_CODE_SML_INVALID_DB_CONF,            "Invalid schemaless db config")
+TAOS_DEFINE_ERROR(TSDB_CODE_SML_NOT_SAME_TYPE,              "Not the same type like before")

 //tsma
 TAOS_DEFINE_ERROR(TSDB_CODE_TSMA_ALREADY_EXIST,             "Tsma already exists")

--- a/tests/script/api/batchprepare.c
+++ b/tests/script/api/batchprepare.c
@@ -2685,6 +2685,8 @@ int main(int argc, char *argv[])

  runAll(taos);

+  taos_close(taos);
+
  return 0;
 }

--- a/tests/script/jenkins/basic.txt
+++ b/tests/script/jenkins/basic.txt
--- a/tests/script/tsim/compute/block_dist.sim
+++ b/tests/script/tsim/compute/block_dist.sim
 system sh/stop_dnodes.sh
 system sh/deploy.sh -n dnode1 -i 1
+system sh/cfg.sh -n dnode1 -c debugflag -v 131
 system sh/exec.sh -n dnode1 -s start
 sql connect

@@ -80,11 +81,11 @@ $nt = $ntPrefix . $i

 #sql select _block_dist() from $nt
 print show table distributed $nt
-sql show table distributed $nt
+sql_error show table distributed $nt

-if $rows == 0 then
-  return -1
-endi
+#if $rows == 0 then
+#  return -1
+#endi

 print ============== TD-5998
 sql_error select _block_dist() from (select * from $nt)

--- a/tests/script/tsim/issue/TD-2677.sim
+++ b/tests/script/tsim/issue/TD-2677.sim
-system sh/stop_dnodes.sh
-
-system sh/deploy.sh -n dnode1 -i 1
-system sh/deploy.sh -n dnode2 -i 2
-system sh/deploy.sh -n dnode3 -i 3
-
-system sh/cfg.sh -n dnode1 -c numOfMnodes -v 3
-system sh/cfg.sh -n dnode2 -c numOfMnodes -v 3
-system sh/cfg.sh -n dnode3 -c numOfMnodes -v 3
-
-system sh/cfg.sh -n dnode1 -c mnodeEqualVnodeNum -v 4
-system sh/cfg.sh -n dnode2 -c mnodeEqualVnodeNum -v 4
-system sh/cfg.sh -n dnode3 -c mnodeEqualVnodeNum -v 4
-
-print ============== deploy
-
-system sh/exec.sh -n dnode1 -s start 
-sql connect
-
-sql create dnode $hostname2
-sql create dnode $hostname3
-system sh/exec.sh -n dnode2 -s start 
-system sh/exec.sh -n dnode3 -s start 
-
-print  =============== step1
-$x = 0
-step1: 
-	$x = $x + 1
-	sleep 1000
-	if $x == 10 then
-		return -1
-	endi
-
-sql show dnodes
-print dnode1 $data4_1
-print dnode2 $data4_2
-print dnode3 $data4_3
-
-if $data4_1 != ready then
-  goto step1
-endi
-if $data4_2 != ready then
-  goto step1
-endi
-if $data4_3 != ready then
-  goto step1
-endi
-
-sql show mnodes
-$mnode1Role = $data2_1
-print mnode1Role $mnode1Role
-$mnode2Role = $data2_2
-print mnode2Role $mnode2Role
-$mnode3Role = $data2_3
-print mnode3Role $mnode3Role
-
-if $mnode1Role != master then
-  goto step1
-endi
-if $mnode2Role != slave then
-  goto step1
-endi
-if $mnode3Role != slave then
-  goto step1
-endi
-
-$x = 1
-show2:
-
-print  =============== step1
-sql create database d1 replica 2 quorum 2
-sql create table d1.t1 (ts timestamp, i int)
-sql_error create table d1.t1 (ts timestamp, i int)
-sql insert into d1.t1 values(now, 1)
-sql select * from d1.t1;
-if $rows != 1 then
-  return -1
-endi
-
-print  =============== step2
-sql create database d2 replica 3 quorum 2
-sql create table d2.t1 (ts timestamp, i int)
-sql_error create table d2.t1 (ts timestamp, i int)
-sql insert into d2.t1 values(now, 1)
-sql select * from d2.t1;
-if $rows != 1 then
-  return -1
-endi
-
-print  =============== step3
-sql       create database d4 replica 1 quorum 1
-sql_error create database d5 replica 1 quorum 2
-sql_error create database d6 replica 1 quorum 3
-sql_error create database d7 replica 1 quorum 4
-sql_error create database d8 replica 1 quorum 0
-sql       create database d9 replica 2 quorum 1
-sql       create database d10 replica 2 quorum 2
-sql_error create database d11 replica 2 quorum 3
-sql_error create database d12 replica 2 quorum 4
-sql_error create database d12 replica 2 quorum 0
-sql       create database d13 replica 3 quorum 1
-sql       create database d14 replica 3 quorum 2
-sql_error create database d15 replica 3 quorum 3
-sql_error create database d16 replica 3 quorum 4
-sql_error create database d17 replica 3 quorum 0
-
-
-system sh/exec.sh -n dnode1 -s stop  -x SIGINT
-system sh/exec.sh -n dnode2 -s stop  -x SIGINT
-system sh/exec.sh -n dnode3 -s stop  -x SIGINT
-system sh/exec.sh -n dnode4 -s stop  -x SIGINT
--- a/tests/script/tsim/issue/TD-2680.sim
+++ b/tests/script/tsim/issue/TD-2680.sim
--- a/tests/script/tsim/issue/TD-2713.sim
+++ b/tests/script/tsim/issue/TD-2713.sim
--- a/tests/script/tsim/issue/TD-3300.sim
+++ b/tests/script/tsim/issue/TD-3300.sim
--- a/tests/script/tsim/parser/alter.sim
+++ b/tests/script/tsim/parser/alter.sim
--- a/tests/script/tsim/parser/alter1.sim
+++ b/tests/script/tsim/parser/alter1.sim
 system sh/stop_dnodes.sh
-
 system sh/deploy.sh -n dnode1 -i 1
-system sh/cfg.sh -n dnode1 -c walLevel -v 1
 system sh/exec.sh -n dnode1 -s start
-sleep 100
 sql connect
-sql reset query cache

 $dbPrefix = alt1_db

@@ -87,9 +83,8 @@ if $data13 != NULL then
  return -1
 endi

-sleep 100
 print ================== insert values into table
-sql insert into car1 values (now, 1, 1,1 ) (now +1s, 2,2,2,) car2 values (now, 1,3,3)
+sql insert into car1 values (now, 1, 1,1 ) (now +1s, 2,2,2) car2 values (now, 1,3,3)

 sql select c1+speed from stb where c1 > 0
 if $rows != 3 then

--- a/tests/script/tsim/parser/alter__for_community_version.sim
+++ b/tests/script/tsim/parser/alter__for_community_version.sim
--- a/tests/script/tsim/parser/alter_column.sim
+++ b/tests/script/tsim/parser/alter_column.sim
--- a/tests/script/tsim/parser/alter_stable.sim
+++ b/tests/script/tsim/parser/alter_stable.sim
--- a/tests/script/tsim/parser/auto_create_tb.sim
+++ b/tests/script/tsim/parser/auto_create_tb.sim
--- a/tests/script/tsim/parser/auto_create_tb_drop_tb.sim
+++ b/tests/script/tsim/parser/auto_create_tb_drop_tb.sim
--- a/tests/script/tsim/parser/between_and.sim
+++ b/tests/script/tsim/parser/between_and.sim
--- a/tests/script/tsim/parser/binary_escapeCharacter.sim
+++ b/tests/script/tsim/parser/binary_escapeCharacter.sim
--- a/tests/script/tsim/parser/col_arithmetic_operation.sim
+++ b/tests/script/tsim/parser/col_arithmetic_operation.sim
--- a/tests/script/tsim/parser/col_arithmetic_query.sim
+++ b/tests/script/tsim/parser/col_arithmetic_query.sim
--- a/tests/script/tsim/parser/columnValue.sim
+++ b/tests/script/tsim/parser/columnValue.sim
--- a/tests/script/tsim/parser/commit.sim
+++ b/tests/script/tsim/parser/commit.sim
--- a/tests/script/tsim/parser/condition.sim
+++ b/tests/script/tsim/parser/condition.sim
--- a/tests/script/tsim/parser/constCol.sim
+++ b/tests/script/tsim/parser/constCol.sim
--- a/tests/script/tsim/parser/create_db.sim
+++ b/tests/script/tsim/parser/create_db.sim
--- a/tests/script/tsim/parser/create_db__for_community_version.sim
+++ b/tests/script/tsim/parser/create_db__for_community_version.sim
--- a/tests/script/tsim/parser/create_mt.sim
+++ b/tests/script/tsim/parser/create_mt.sim
--- a/tests/script/tsim/parser/create_tb.sim
+++ b/tests/script/tsim/parser/create_tb.sim
--- a/tests/script/tsim/parser/create_tb_with_tag_name.sim
+++ b/tests/script/tsim/parser/create_tb_with_tag_name.sim
--- a/tests/script/tsim/parser/dbtbnameValidate.sim
+++ b/tests/script/tsim/parser/dbtbnameValidate.sim
--- a/tests/script/tsim/parser/distinct.sim
+++ b/tests/script/tsim/parser/distinct.sim
--- a/tests/script/tsim/parser/fill.sim
+++ b/tests/script/tsim/parser/fill.sim
--- a/tests/script/tsim/parser/fill_stb.sim
+++ b/tests/script/tsim/parser/fill_stb.sim
--- a/tests/script/tsim/parser/fill_us.sim
+++ b/tests/script/tsim/parser/fill_us.sim
--- a/tests/script/tsim/parser/first_last.sim
+++ b/tests/script/tsim/parser/first_last.sim
--- a/tests/script/tsim/parser/fourArithmetic-basic.sim
+++ b/tests/script/tsim/parser/fourArithmetic-basic.sim
--- a/tests/script/tsim/parser/function.sim
+++ b/tests/script/tsim/parser/function.sim
--- a/tests/script/tsim/parser/groupby-basic.sim
+++ b/tests/script/tsim/parser/groupby-basic.sim
--- a/tests/script/tsim/parser/groupby.sim
+++ b/tests/script/tsim/parser/groupby.sim
--- a/tests/script/tsim/parser/having.sim
+++ b/tests/script/tsim/parser/having.sim
--- a/tests/script/tsim/parser/having_child.sim
+++ b/tests/script/tsim/parser/having_child.sim
--- a/tests/script/tsim/parser/import.sim
+++ b/tests/script/tsim/parser/import.sim
--- a/tests/script/tsim/parser/import_commit1.sim
+++ b/tests/script/tsim/parser/import_commit1.sim
--- a/tests/script/tsim/parser/import_commit2.sim
+++ b/tests/script/tsim/parser/import_commit2.sim
--- a/tests/script/tsim/parser/import_commit3.sim
+++ b/tests/script/tsim/parser/import_commit3.sim
--- a/tests/script/tsim/parser/import_file.sim
+++ b/tests/script/tsim/parser/import_file.sim
--- a/tests/script/tsim/parser/insert_multiTbl.sim
+++ b/tests/script/tsim/parser/insert_multiTbl.sim
--- a/tests/script/tsim/parser/insert_tb.sim
+++ b/tests/script/tsim/parser/insert_tb.sim
--- a/tests/script/tsim/parser/interp.sim
+++ b/tests/script/tsim/parser/interp.sim
--- a/tests/script/tsim/parser/join.sim
+++ b/tests/script/tsim/parser/join.sim
--- a/tests/script/tsim/parser/join_manyblocks.sim
+++ b/tests/script/tsim/parser/join_manyblocks.sim
--- a/tests/script/tsim/parser/join_multitables.sim
+++ b/tests/script/tsim/parser/join_multitables.sim
--- a/tests/script/tsim/parser/join_multivnode.sim
+++ b/tests/script/tsim/parser/join_multivnode.sim
--- a/tests/script/tsim/parser/last_cache.sim
+++ b/tests/script/tsim/parser/last_cache.sim
--- a/tests/script/tsim/parser/last_groupby.sim
+++ b/tests/script/tsim/parser/last_groupby.sim
--- a/tests/script/tsim/parser/lastrow.sim
+++ b/tests/script/tsim/parser/lastrow.sim
--- a/tests/script/tsim/parser/like.sim
+++ b/tests/script/tsim/parser/like.sim
--- a/tests/script/tsim/parser/limit.sim
+++ b/tests/script/tsim/parser/limit.sim
--- a/tests/script/tsim/parser/limit1.sim
+++ b/tests/script/tsim/parser/limit1.sim
--- a/tests/script/tsim/parser/limit1_tblocks100.sim
+++ b/tests/script/tsim/parser/limit1_tblocks100.sim
--- a/tests/script/tsim/parser/limit2.sim
+++ b/tests/script/tsim/parser/limit2.sim
--- a/tests/script/tsim/parser/limit2_tblocks100.sim
+++ b/tests/script/tsim/parser/limit2_tblocks100.sim
--- a/tests/script/tsim/parser/line_insert.sim
+++ b/tests/script/tsim/parser/line_insert.sim
--- a/tests/script/tsim/parser/mixed_blocks.sim
+++ b/tests/script/tsim/parser/mixed_blocks.sim
--- a/tests/script/tsim/parser/nchar.sim
+++ b/tests/script/tsim/parser/nchar.sim
--- a/tests/script/tsim/parser/nestquery.sim
+++ b/tests/script/tsim/parser/nestquery.sim
--- a/tests/script/tsim/parser/null_char.sim
+++ b/tests/script/tsim/parser/null_char.sim
--- a/tests/script/tsim/parser/precision_ns.sim
+++ b/tests/script/tsim/parser/precision_ns.sim
--- a/tests/script/tsim/parser/projection_limit_offset.sim
+++ b/tests/script/tsim/parser/projection_limit_offset.sim
--- a/tests/script/tsim/parser/regex.sim
+++ b/tests/script/tsim/parser/regex.sim
--- a/tests/script/tsim/parser/repeatAlter.sim
+++ b/tests/script/tsim/parser/repeatAlter.sim
--- a/tests/script/tsim/parser/selectResNum.sim
+++ b/tests/script/tsim/parser/selectResNum.sim
--- a/tests/script/tsim/parser/select_across_vnodes.sim
+++ b/tests/script/tsim/parser/select_across_vnodes.sim
--- a/tests/script/tsim/parser/select_distinct_tag.sim
+++ b/tests/script/tsim/parser/select_distinct_tag.sim
--- a/tests/script/tsim/parser/select_from_cache_disk.sim
+++ b/tests/script/tsim/parser/select_from_cache_disk.sim
--- a/tests/script/tsim/parser/select_with_tags.sim
+++ b/tests/script/tsim/parser/select_with_tags.sim
--- a/tests/script/tsim/parser/set_tag_vals.sim
+++ b/tests/script/tsim/parser/set_tag_vals.sim
--- a/tests/script/tsim/parser/single_row_in_tb.sim
+++ b/tests/script/tsim/parser/single_row_in_tb.sim
--- a/tests/script/tsim/parser/sliding.sim
+++ b/tests/script/tsim/parser/sliding.sim
--- a/tests/script/tsim/parser/slimit.sim
+++ b/tests/script/tsim/parser/slimit.sim
--- a/tests/script/tsim/parser/slimit1.sim
+++ b/tests/script/tsim/parser/slimit1.sim
--- a/tests/script/tsim/parser/stableOp.sim
+++ b/tests/script/tsim/parser/stableOp.sim
--- a/tests/script/tsim/parser/tags_dynamically_specifiy.sim
+++ b/tests/script/tsim/parser/tags_dynamically_specifiy.sim
--- a/tests/script/tsim/parser/tags_filter.sim
+++ b/tests/script/tsim/parser/tags_filter.sim
--- a/tests/script/tsim/parser/tbnameIn.sim
+++ b/tests/script/tsim/parser/tbnameIn.sim
--- a/tests/script/tsim/parser/timestamp.sim
+++ b/tests/script/tsim/parser/timestamp.sim
--- a/tests/script/tsim/parser/top_groupby.sim
+++ b/tests/script/tsim/parser/top_groupby.sim
--- a/tests/script/tsim/parser/topbot.sim
+++ b/tests/script/tsim/parser/topbot.sim
--- a/tests/script/tsim/parser/union.sim
+++ b/tests/script/tsim/parser/union.sim
--- a/tests/script/tsim/parser/where.sim
+++ b/tests/script/tsim/parser/where.sim
--- a/tests/script/tsim/qnode/basic1.sim
+++ b/tests/script/tsim/qnode/basic1.sim
--- a/tests/script/tsim/snode/basic1.sim
+++ b/tests/script/tsim/snode/basic1.sim
--- a/tests/script/tsim/sync/3Replica1VgElect.sim
+++ b/tests/script/tsim/sync/3Replica1VgElect.sim
--- a/tests/script/tsim/sync/3Replica5VgElect.sim
+++ b/tests/script/tsim/sync/3Replica5VgElect.sim
--- a/tests/script/tsim/sync/3Replica5VgElect3mnode.sim
+++ b/tests/script/tsim/sync/3Replica5VgElect3mnode.sim
--- a/tests/script/tsim/sync/3Replica5VgElect3mnodedrop.sim
+++ b/tests/script/tsim/sync/3Replica5VgElect3mnodedrop.sim
--- a/tests/script/tsim/sync/electTest.sim
+++ b/tests/script/tsim/sync/electTest.sim
--- a/tests/script/tsim/sync/mnodeLeaderTransfer.sim
+++ b/tests/script/tsim/sync/mnodeLeaderTransfer.sim
--- a/tests/script/tsim/sync/oneReplica1VgElect.sim
+++ b/tests/script/tsim/sync/oneReplica1VgElect.sim
--- a/tests/script/tsim/sync/oneReplica1VgElectWithInsert.sim
+++ b/tests/script/tsim/sync/oneReplica1VgElectWithInsert.sim
--- a/tests/script/tsim/sync/oneReplica5VgElect.sim
+++ b/tests/script/tsim/sync/oneReplica5VgElect.sim
--- a/tests/script/tsim/sync/threeReplica1VgElect.sim
+++ b/tests/script/tsim/sync/threeReplica1VgElect.sim
--- a/tests/script/tsim/sync/threeReplica1VgElectWihtInsert.sim
+++ b/tests/script/tsim/sync/threeReplica1VgElectWihtInsert.sim
--- a/tests/script/tsim/sync/vnode-insert.sim
+++ b/tests/script/tsim/sync/vnode-insert.sim
--- a/tests/script/tsim/sync/vnodeLeaderTransfer.sim
+++ b/tests/script/tsim/sync/vnodeLeaderTransfer.sim
--- a/tests/script/tsim/sync/vnodesnapshot-test.sim
+++ b/tests/script/tsim/sync/vnodesnapshot-test.sim
--- a/tests/script/tsim/sync/vnodesnapshot.sim
+++ b/tests/script/tsim/sync/vnodesnapshot.sim
--- a/tests/script/tsim/table/back_insert.sim
+++ b/tests/script/tsim/table/back_insert.sim
--- a/tests/script/tsim/table/delete_reuse1.sim
+++ b/tests/script/tsim/table/delete_reuse1.sim
--- a/tests/script/tsim/table/delete_reuse2.sim
+++ b/tests/script/tsim/table/delete_reuse2.sim
--- a/tests/script/tsim/table/delete_writing.sim
+++ b/tests/script/tsim/table/delete_writing.sim
--- a/tests/script/tsim/valgrind/basic1.sim
+++ b/tests/script/tsim/valgrind/basic1.sim
--- a/tests/script/tsim/valgrind/checkError3.sim
+++ b/tests/script/tsim/valgrind/checkError3.sim
--- a/tests/system-test/2-query/csum.py
+++ b/tests/system-test/2-query/csum.py
--- a/tests/system-test/2-query/json_tag_large_tables.py
+++ b/tests/system-test/2-query/json_tag_large_tables.py
--- a/tests/system-test/2-query/last_row.py
+++ b/tests/system-test/2-query/last_row.py
--- a/tests/system-test/2-query/max_partition.py
+++ b/tests/system-test/2-query/max_partition.py
--- a/tests/system-test/2-query/sample.py
+++ b/tests/system-test/2-query/sample.py
--- a/tests/system-test/2-query/tsbsQuery.py
+++ b/tests/system-test/2-query/tsbsQuery.py
--- a/tools/CMakeLists.txt
+++ b/tools/CMakeLists.txt
--- a/taos-tools @ d807c3ff
+++ b/taos-tools @ d807c3ff
--- a/taosws-rs @ c5fded26
+++ b/taosws-rs @ c5fded26